SUMMARY: Random machine hangs - revisited

From: Greg Earle - Gainfully Unemployed (earle@elroy.Jpl.Nasa.Gov)
Date: Thu Dec 03 1992 - 09:53:52 CST


Yesterday I posted a query about some machine hangs that Mark Holm had
previously posted about, and I was seeing the same symptoms. His original
is too long to regurgitate, but the basic gist was that after a main NFS/NIS
& automount export server rebooted, some clients would hang sometime later,
with processes run from shells hanging, and attempts to login remotely also
hanging as well. Mark's original described essentially the same exact
symptoms of hanging processes and logins.

The (presumed) actual problem was figured out with some gruntwork by me, but
mostly due to helpful assistance from Brent Callaghan, author of automount.
My not-inconsiderable gratitude to Brent for his (quick!) assistance.

The problem was that I set up my customer's systems to automount /usr/share in
a replicated fashion (from the main server and an NIS slave server). The
trouble with this is that /usr/share is usually mounted (think about it: a
user logging on will undoubtedly trigger this with a reference to termcap),
but sometimes it can time out (e.g., no one on the machine) and be unmounted.

Well, the "automount" process can then come along and decide to do a timestamp
at some point, and it calls localtime(3). This library routine will at some
point attempt to open "/usr/share/lib/zoneinfo/localtime" (this is 4.1.2 btw).
Guess what happens if /usr/share is automounted and not currently mounted?
Bingo, automounter deadlock. "automount" gets stuck in "DW" state and any new
processes that tickle automount mount points ("ls -l /" will cause a stat(2)
of "/home", for example - and hang) also hang in "D" state. Brent says that
the automounter man page now explicitly warns about not automounting /usr/share
(I presume this is the 5.x man page; the 4.1.2 man page does not warn) because
of this potential deadlock scenario.

Summary of suggestions I received:

(1) Enable savecore and provoke a crash dump if necessary; yes I was thinking
    about resorting to this (none of the machines have crash dumps enabled),
    but I could still get on afflicted machines via "rsh sick /usr/bin/csh -i"
    so I was looking for ways to solve it in "normal" mode. Brent pointed
    out that if I suspected the automounter, I could do a "ps" to get the PID
    and then do

        sick# /usr/bin/adb -k -w /vmunix /dev/mem
        physmem xxx

        0t<PID>$<setproc
        $C

    to get the stack backtrace, and then I could dump out the proc structure
    and the user structure for more info. Turned out the stack backtrace was
    all that was needed, as one of the arguments to a vn_open() turned out to
    be the "/usr/share/lib/zoneinfo/localtime" that told Brent the problem.

(2) Some people suggested that perhaps automount was hanging on a bad NIS map.
    As I'd made some significant changes to auto.vol, I will re-make the NIS
    maps and re-init the slave server as a paranoiac step, when I get a chance.

(3) People suggested that the "panic: iinactive" could be caused by a bad disk
    (not in this case) or as a byproduct of stale mounts (more likely).

(4) A couple of people told me that I was supposed to be *answering* people's
    questions, not asking them (-: Although I agree (-: in my own defense I'm
    a little rusty doing "live" sys admin, and more importantly, this was one
    of those classic cases of people-under-stressful-project-deadlines and
    we-can't-afford-even-a-minute-of-downtime environments, and I thought that
    asking if others had encountered this (I never had seen anything like this
    even supporting 1300+ users at JPL) right away might prove quicker than
    simply trying to take the time to figure it out myself (nothing like some
    people with furrowed eyebrows and steam rising out of their skulls to make
    you look for a quick answer, instead of calmly deciding on a course of
    debugging action (-: )

Anyone who wants to publically flaggelate me for my lack of Wizard's knowledge
can so do next Thursday afternoon at SUG in San Jose, where I'll be sitting on
Dinah McNutt's "System Administration Wizard's Panel". You can ask me why I
deserve to sit on it if I need to ask for help from sun-managers ... (-:

My thanks to the following respondants:

BROOME@ECCLES.NZDRI.Org.NZ
brent@terra.Eng.Sun.COM
c3314jcl@mercury.nwac.sea06.navy.mil
geertj@ica.philips.nl
robinson@Eng.Sun.COM
birger@vest.sdata.no
danny@ews7.dseg.ti.com
Perry_Hutchison.Portland@xerox.com
ups!upstage!glenn@fourx.Aus.Sun.COM
markh@analogy.com

--
	- Greg Earle
	  Itinerant Sun Consultant
	  earle@elroy.JPL.NASA.GOV
	  earle@cminet.UUCP-neighbor.visicom.VisiCom.COM



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:54 CDT