SUMMARY: LOTS of issues with a solaris 2.5.1 box

From: Sweth Chandramouli (sweth@astaroth.nit.gwu.edu)
Date: Fri Feb 19 1999 - 18:25:10 CST


        In many ways, this turned out to be a comedy of errors, where none
of the problems that I thought I had were actually what I thought. Before
I explain, though, here's the summary of the advice that I received (which
would have all been very helpful had I really understood the problem and
been asking the right questions):
        Charlie Mengler and Al Hopper both suggested disconnecting the cdrom,
which they thought could have been flaking out and messing up the SCSI chain,
thus affecting the system disks. John White and Celeste Stokely both
suggested some interaction between the APC startup script and the bd processes,
while James Coby sent me a copy of a sunsolve summary of a similar problem with
ps messing up bd (which I have included at the end of this message); as far as
I can tell, it _was_ the APC script that unlinked bd.off, and following the
suggestion that James sent solved that problem. Celeste also suggested that
the random x-s on the terminal indicated a baud-rate mismatch between the console
port and the terminal, but was confused as to why this would happen midway
through the boot; my best guess is that the APC script was also causing the serial
port to reset speed, since the UPS that used to connect to this box was connected
via the serial port. Finally, Richard Smith suggested checking the SCSI id of
the cdrom to make sure there were no conflicts, and also recommended checking
the inittab to see if any errors there were affecting what scripts were run
during boot.
        The actual problem, as it turned out, was that this was a machine that
I inherited from another admin, and which I assumed was configured in a vaguely
sane way, so that I had never really taken a good look at how it was set up.
For one thing, upon opening the case, I discovered that the cdrom was not showing
up because it had already been disconnected! It appears that at some point before
leaving, the prior admin for this machine had needed a differential SCSI cable,
which this machine had, so he had "appropriated" it, and never told anyone; re-
connecting the cdrom and rebooting with -r solved that problem.
        The bd problem, as I mentioned, was solved by James Coby's solution, but
it, too, was an issue of a poorly configured server; the bd module provides
support of Sun's Buttons&Dials interface, which from what I can tell is not needed
for a server without any need for graphical interfaces, and thus should not have
even been installed on this machine. (I am currently building the successor to
this machine, so I didn't bother to clean up all of the extraneous graphics drivers
and such things that I also found on the machine--I'll take care of that later,
during a clean reinstall.)
        Finally, according to the Sun engineer who happened to be in our machine
room that evening, the /etc/rc3.d problem where K scripts were being run appears to
be a bug/feature in Solaris (at least through version 2.6), although I haven't
found any official mention of this problem anywhere before. Basically, from what
I understand (and I have a ticket open with Sun for someone to explain it to me
in more detail if it _is_ actually a bug), init will run _any_ script in /etc/rc3.d,
regardless of initial letter; the logic that I was given was that since run-level 4
is not used, and K scripts in a given "rc<n>.d" directory are run only when Solaris
changes run-levels from run-level <n+1> to <n>, no K scripts would ever need to
be placed in rc3.d--the system never changes from run-level 4 to run-level 3. Thus,
init doesn't consider the possiblity that a script in rc3.d might be a K script,
and instead assumes that all scripts there are S scripts. It sounds like a poor
coding decision to me, but the fact is that those particular K scripts _should_
never have been in rc3.d; moving them to rc2.d cleared up those issues as well.

        (Sorry about the long delay in this summary, but the reason a Sun engineer
_was_ in our machine room that night was that we were having lots of issues with
another, more important system here, and I've been spending way too much time
trying to help out with that machine for the last few weeks. I'll probably be
posting a few questions about that other machine shortly, in fact.)

        -- sweth.

On Fri, Feb 05, 1999 at 10:41:08PM -0500, Sweth Chandramouli wrote:
> i've got a 300mhz sparc ultra/2 running solaris 2.5.1 that up and
> crashed on me this evening; my first sign of something wrong was that it dropped
> off the network entirely. when i ran to the console (a dumb terminal attached
> to the serial port), everything appeared to be locked up, and there were two
> rows of gibberish (mostly "x" and that strange norwegian character that looks
> like an o with a slash through it) after the login prompt; the only other
> evidence of something amiss was that the yellow activity light for the cdrom
> drive (which was empty) was on continually. after trying unsuccessfully to get
> any sort of response out of the machine (it wouldn't even drop to the openboot
> prompt after a break), i bit the big one and did a hard reboot. it posted fine,
> but again wouldn't respond to a break; it _was_ accepting at least _some_ input
> from the dumb terminal, however, because it would (sometimes) print a carriage
> return if i hit the enter key. it started the kernel, and began going through
> the startup scripts; when it got to starting vold, however, it again started
> spewing lots of "x"s, and hung.
> i tried this a few more times, and then tried rebooting again with a
> different, known-good dumb terminal. this time, it would still hang while
> starting vold (or so it seemed), but i could use break to get to the ok prompt.
> a test-all showed no problems, but a probe-scsi only showed the two hard disks
> in the machine--not the cd-rom. i then rebooted into single-user mode, which
> worked fine. (the entire time that this was going on, the yellow activity light
> for the cdrom was on whenever the machine was powered up.)
> looking in the /etc/rc2.d directory i found that the machine was, after
> vold, trying to run a startup script for an aps powerchute backup power supply
> to which it had once been attached. (when i inherited responsibility for this
> machine, it was a real mess, and i have been planning on bringing up another
> machine to replace it for the last few months, so i kept telling myself that i
> didn't really need to go through it and clean up after the old sysadmin...)
> when i removed that file and rebooted, things seemed to start just fine, which
> led me to believe that it was that script, and not vold, that had been hanging.
>
> HOWEVER...
>
> things are still very strange:
>
> * probe-scsi (along with vold or anything else in the os, obviously)
> still does not acknowledge the cdrom, which still has its activity light
> constantly on.
>
> * the /etc/rc2.d/S89bdconfig script (of which i know nothing) returns
> the following error:
>
> /dev/bd.off: not a serial device.
> bdconfig: no serial device configured. Run bdconfig interactively (no args).
>
> and /dev/bd.off appears to be a link pointing to nothing, which i assume
> is not a good thing:
>
> # ls -ld /dev/bd.off
> lrwxrwxrwx 1 root root 0 Feb 5 21:29 /dev/bd.off ->
>
> and, strangest of all, it appears that during startup, ALL scripts in
> the /etc/rc3.d directory are being run--those starting with K as well as those
> starting with S.
>
> luckily(?), one of the other departments here is doing a large migration
> tonight, and paid for a sun engineer to come in and "be available" for part of
> it. he should be arriving in about 4 hours, so if it comes to that, i might be
> able to divert him to my machine for a while. i'd rather get this fixed before
> then, of course, and i also haven't had the best of experiences with the last
> few sun engineers i've had to deal with. so, does anyone else out there have
> any suggestions?
>
> (people receiving this via dc-sage, please e-mail me directly; i'll cc
> my summary to sun-managers to dc-sage as well.)
>
> t a million times ia,
> sweth.
>
>
> --
> Sweth Chandramouli
> IS Coordinator, The George Washington University
> <sweth@gwu.edu> / (202) 994 - 8521 (V) / (202) 994 - 0458 (F)
> *

On Mon, Feb 08, 1999 at 01:09:44PM -0600, James Coby wrote:
>
> SRDB ID: 6553
>
> SYNOPSIS: ps hangs without output or header
>
> DETAIL DESCRIPTION:
>
>
> In Solaris 2.X:
>
> Sometimes, after an abrupt system crash, bd.off becomes a
> link to nowhere. When this occurs, the ps command hangs
> without displaying the header or the process data.
>
> @ 16: ls -l bd*
> lrwxrwxrwx 1 root root 11 Nov910:59bd.off ->
>
> Note:
>
> If ps is used in the .login script, or is part of the boot process,
> it can appear as though the system is hung.
>
> SOLUTION SUMMARY:
>
>
> The bd STREAMS module processes the byte streams generated
> by the SunButtons buttonbox and SunDials dialbox.
>
> If ps is hanging without output, go to /dev
> and remove bd.off. Then boot -r and the problem should be
> corrected.
>
> After the boot -r, the correct link will be re-established:
>
> @ 17: ls -l bd*
> lrwxrwxrwx 1 root root 11 Nov 9 10:59 bd.off -> /dev/term/b

-- 
Sweth Chandramouli
IS Coordinator, The George Washington University
<sweth@gwu.edu> / (202) 994 - 8521 (V) / (202) 994 - 0458 (F)
<a href="http://astaroth.nit.gwu.edu/~sweth/disc.html">*</a>



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:13:15 CDT