SUMMARY: Server dies with BAD TRAP on nfsd ( LONG )

From: nicky@davinci.concordia.ca
Date: Wed Sep 04 1991 - 14:48:10 CDT


        First off... My apologies...
        Anyway, more than one person sent me a note saying "you can't
        get info if you don't give info".

        So here is the recap...
        My Sun 4/370 running SunOS 4.1.1 was crashing almost every 2 hours.
        I had know idea why. Here is the dmesg of an average panic...

BAD TRAP
pid 127, `nfsd': Data fault
kernel read fault at addr=0xe0000007, pme=0x70000080
Bus Error Reg 0
pc=0xf8006ba4, sp=0xf811efa0, psr=0x11401ec0, context=0x0
g1-g7: e0000007, e2000f94, ffffffff, 85, f82c1000, f8146400, f8146400
Begin traceback... sp = f811efa0
Called from f8109f88, fp=f82c0a60, args=600 114016e2 114010e2 f815ac8c ff06e1ac
ff06e6e4
End traceback...
panic: Data fault
syncing file systems... [23] [20] [12] done

        Pierre Laplante was kind enough to mail me an NFS patch he had. The problem
        however was related to ethernet activity and would show up most often in
        connection with NFS activity.
        As stated in my follow up to Sum-managers "More info on BAD
TRAP provided..."
        a few respondants stated that I should provide a traceback. They directed me
        to the savecore(8) and crash(8) programs. I have localized the problem.
        I have noticed that nfsd is not the only proc that experiences the panic.
        Using "crash -d vmcore.1 -n vmunix.1" I was able to produce a
traceback. I looked
        at about 4 crashes and noticed that stack was always _panic,
_trap, _iecustart.

        Chris Drake and Geert Jan de Groot told me that I was facing an
ethernet interface
        problem. Not good on a gateway. I started using the ie
interface heavily and the
        machine went down right away. As a test I placed the server's gateway
        responsibilities on my secondary NIS server, which happens to
be a gateway as
        well. I brought down the ie interface and started using the secondary as the
        primary gateway. Heavy use seemed to work fine once the ie interface
        was taken out of the picture.

        Sun Service has been notified.

        Thanks a lot. I learned a great deal from this experience.
        Lots of neat info follows.

Thanks to all who responded.

Hal Stern - Consultant <stern@bitatron.Eng.Sun.COM>
Marc Phillips <phillips@athena.Qualcomm.COM>
Kevin Sheehan {Consulting Poster Child} <kevins@Aus.Sun.COM>
Greg Higgins <higgins@mp.cs.niu.edu>
Brent Alan Wiese <brent@curie.ssctr.bcm.tmc.edu>
Andy Stefancik 234-3049 <ajs6143@eerpf001.boeing.com>
Ricardo Uribe <uriber@oes.orst.edu>
Frank Kuiper <frankk@cwi.nl>
Pierre Laplante <laplante@iro.umontreal.ca>

And special thanks to those who more or less walked me through to the answer.

Chris Drake <Chris.Drake@corp.sun.com>
Geert Jan de Groot <geertj@ica.philips.nl>

Nicky
-----------------------+---------------------------
Nick Ayoub nicky@davinci.concordia.ca
Dept. of E&CE Concordia University
Montreal QC. H3G 1M8 Voice : (514) 848-3107
Canada Fax : (514) 848-2802
-----------------------+---------------------------

---------------------------------
From: Hal Stern - Consultant <stern@bitatron.Eng.Sun.COM>

a bad trap is a hardware event that the os didn't expect
and has no idea how to handle. this one -- a data read
fault @ address e0000007 -- is caused by something asking
the kernel to read from a bogus address. you can also
get data faults if you dereference null points in the
kernel, or if you have memory board h/w problems. this
looks like a possible NFS bug.

take the pc value (under the Bus Error reg 0) and the
"called from" addresses and feed them to adb to see
where your panic occured:
        # adb /vmunix
        0xf8006ba4?ia
        f8109f88?ia
and so on -- this produces a symbolic traceback of where
the problem occured.

--hal stern
---------------------------------
From: Chris Drake <Chris.Drake@Corp.Sun.COM>

Basically, a data fault on a sun-4 is the kernel equivalent of a segmentation
violation: something was trying to reference an address which was invalid, for
lots of possible reasons. To isolate this one, it would greatly help to get
a stack trace, if you can, or save the core file generated when the machine
crashes.

What we can get from this so far:

        kernel read fault at addr=0xe0000007, pme=0x70000080

The page table entry (pme) has the 'valid' bit turned off, which means this
particular page has been marked as inaccessible for some reason. The address
(0xe0000007) is sort of low; below the PC, which doesn't look good. Could be
a wild pointer. (Kernel data normally follows text, and would thus have a
higher address).

What is necessary to go further:
        - traceback!
        - get the symbolic address of the instruction which failed (ie,
          what routine was it in at pc=0xf8006ba4?)
        - what version of the system is this actually running?
        - have there been any other crashes? Have there been any
          hardware or software changes?
        - for the other similar panics: is the PC the same? The address?
        - what is running when the system dies, anything in particular?

[ and in a later note ...]

Since this is consistently in the ethernet stuff, I'd tentatively suggest it's
hardware (the ethernet interface croaked), especially if the trouble just
started and there was nothing changed in either hardware or software. The
place it crashes (assuming I'm looking in the right place) appears to be in
the ethernet driver where it's trying to scan a chain of control blocks (?)
and restart the chip, or at least tell it to start transmitting. (Not being
an ether expert, this is a rough guess). Anyway, I looked for bugs or patches
relating to this traceback and found nothing - which may or may not be
meaningful. Anyway, it'd definitely related to ethernet activity (which would
explain the preponderance of nfsd's which were running when the crashes
occurred).

---------------------------------
From: Brent Alan Wiese <brent@curie.ssctr.bcm.tmc.edu>

     I saw a similar message when the Bus Grant 3 and Interupt Acknowledge
were not jumped properly for our 7053 controller in slot 7 of a 4/360.
The system would crash though. Sounds like a board problem to me, but
I'm no expert here.
     The value of "addr" from the following line ...

> kernel read fault at addr=0xe0000007, pme=0x70000080

should give you a hint about which board it is. Look at the autoconfig
lines that are displayed when the system boots. Reseating or removing
the board (make sure you install jumpers if you remove a board) might help.

---------------------------------
From: Geert Jan de Groot <geertj@ica.philips.nl>

 the addresses you show are useless without the namelist of the
corresponding kernel. I suggest that you enable savecores (check /etc/rc.local),
and await the next crash, then look into the problem using adb and
the files generated with savecore (in /var/crash).
What you can check here is described in the sysadmin manual (really..)

[and in a later note ...]

I have checked our patches-database (I have a copy of the README's of
all bugs for which patches exist), and as of last sunday, I couldn't
find a patch which looked like your problem. the only patches
that want to replace ie.o (the ethernet-driver, where the panic come from),
are:
        100261-02 SunOS 4.1.1: Sun4-490,4-470 only,misaligned frames
from ie controller during heavy tr
        100321-01 SunOS 4.1.1:system crashes with iesynccmd panic
neither of which seem to apply to your problem. The last patch has to do
with multicasting, the first patch just prevents some cruft being sent
on the network.

---------------------------------------
Thanks to all who responded.

Hal Stern - Consultant <stern@bitatron.Eng.Sun.COM>
Marc Phillips <phillips@athena.Qualcomm.COM>
Kevin Sheehan {Consulting Poster Child} <kevins@Aus.Sun.COM>
Greg Higgins <higgins@mp.cs.niu.edu>
Brent Alan Wiese <brent@curie.ssctr.bcm.tmc.edu>
Andy Stefancik 234-3049 <ajs6143@eerpf001.boeing.com>
Ricardo Uribe <uriber@oes.orst.edu>
Frank Kuiper <frankk@cwi.nl>
Pierre Laplante <laplante@iro.umontreal.ca>

And special thanks to those who more or less walked me through to the answer.

Chris Drake <Chris.Drake@corp.sun.com>
Geert Jan de Groot <geertj@ica.philips.nl>



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:19 CDT