SUMMARY: Famed Bus Error Reg 80

From: Carlo L. Tiana (carlo@vision.arc.nasa.gov)
Date: Thu Nov 14 1991 - 21:08:37 CST


Another late, and still inconclusive summary; but I guessed no more useful
info would come in at this point.

Original Posting:
-----------------
> I know this has been discussed before, but I can't find the relevant
> stuff. I remember saving it in a special file because I knew I was
> going to need it.....
> Anyway. What is the wisdom about this "Bus Error Reg 80<INVALID>" ?
> I have now seen this on 3 different Sun4 machines (a 4/370 a 4/300 and
> a 4/110); I seem to get this every time I try to install hardware in the
> machines; I presently have 2 Sky Warriors in a 4/370 which won't run
> because every time one tries to use them they crash the host this way; and
> an interface card in a 4/110 that does the same. I believe I heard the
> problem is on the Sun side, and I guess this is backed up by the fact that
> the different hardware does the sdame thing to different machines. Any
> hints? Here is the traceback from the latest occurrence (line breaks are
> mine).
>
> vmunix: BAD TRAP
> vmunix: (unknown): Data fault
> vmunix: kernel read fault at addr=0x98, pme=0x70000000
> vmunix: Bus Error Reg 80<INVALID>
> vmunix: pc=0xf8036a9c, sp=0xf80b5c50, psr=0x400cc6, context=0x0
> vmunix: g1-g7: f80ad0b8, 0, ffffffff, 3a, f80b6c00, 40000000, 22
> vmunix: Begin traceback... sp = f80b5c50
>
> <numeric traceback in original posting omitted as many pointed out it is
> useless and counter to the sun-managers rules to post one; I know it's
> useless to me; I didn't realize it was useless to the gurus out there; one
> day I plan to find out how to send out a more useful version of a
> traceback>
>
> vmunix: End traceback...
> vmunix: panic: Data fault
>
> More than ever appreciative of help,
> Carlo Tiana
> NASA Ames Research Center - (415) 604-0001
>

Credits:
--------
From: dj@astro.lsa.umich.edu
From: "barking at airplanes" <rodney@snowhite.cis.uoguelph.ca>
From: Douglas W. Johnson <doug@aer.com>
From: Nigel Titley <ntitley@axion.bt.co.uk>
From: kirk@zabriskie.berkeley.edu (Kirk Thege)
From: Joe Angelo <angelo@enterprise.arc.nasa.gov>
From: todd@flex.Eng.McMaster.CA (Todd Pfaff)
From: Chris.Drake@Corp.Sun.COM (Chris Drake)
From: kevins@Aus.Sun.COM (Kevin Sheehan {Consulting Poster Child})
From: mikulska@ece.UCSD.EDU (Margaret Mikulska)
From: stern@sunne.East.Sun.COM (Hal Stern - NE Area Tactical Engineering)
From: stumpf@sun8.psychologie.uni-freiburg.de (Michael Stumpf)
From: execu!sequoia.execu.com!unisql!alfred@cs.utexas.edu (Alfred Correira)

Findings:
---------
As I suspected, this had been asked before, and a summary posted by
From: "barking at airplanes" <rodney@snowhite.cis.uoguelph.ca>
This summary was forwarded to me by dj@astro.lsa.umich.edu. It is not a
conclusive one, but I enclose it at the end of this message for
completeness.
Here are new comments:

-someone having the same problem with different hardware; works flawlessly
 in a 4/470, crashes a 4/370. Sun - you guessed it - suggested upgrading to
 a 4/470.

-Verbatim from:
        Chris Drake
        US Answer Center
        Sun Microsystems Software Support
 "Maybe hardware, maybe not.
 The "Bus Error Register" contains the bits which identify what kind of error
 occurred - in this case, a reference to an invalid page. This, according to
 the messages file, is a Sun-4, kernel unknown, but it tried to read from
 location 0x98 - an unusable page - probably from within the device driver
 which attempts to reference the hardware.
 This could be due to lots of things, but assuming the driver(s) are fairly
 well debugged, then it might well be configuration on the card (is it jumpered
 to correspond to what the config file for the kernel claims?) or backplane
 problems (jumpers removed or reinstalled?). You need to get more information
 in order to go much further..."
 Good suggestions. We are definitely looking into it further. The jumpers
 are probably ok (our guess) since the cards work in a 3/160. The drivers
 we have to assume are well debugged; to most of them we don't have
 source.

-"I seem to recall that BAD TRAP happens when the kernel receives a weird
 hardware interrupt which it has no idea how to handle."

-Verbatim from:
 stern@sunne.East.Sun.COM (Hal Stern - NE Area Tactical Engineering)
 "a "reg 80 invalid" usually means that you tried reading through a null
 pointer. in your case, the read fault was at address 0x98, which
 is invalid (the first page of memory is marked invalid). this was
 probably an offset through a null pointer."
 He also suggests that the driver could be buggy (agreed, though see above)
 or that the board could be faulty (same, though see above also). He
 suggested doing
 # echo "f803ba9c?ia" | adb -k /vmunix /dev/mem
 and that if the routine shown "...it's part of the sky board driver, it's
 a bug. if it's in the rest of the kernel, then you may have a race
 condition or be passing something strange back from the board."

-Someone says the only "solution" "I know of is to run GENERIC kernels
 (which works for me!!!).". Well, good one..... :-) I may add, that leaving
 the machine powered down also does not exercise the problem. :-)

Comments:
---------
Our 4/300 class machines were purchased at different times. We recently
tried putting the Sky Warrior in the later-Rev machine, and it appears to
work (whereas it crashes the older machines invariably). I will draw no
conclusions from this, until further testing proves that it does work - let
me just say that it passes the preliminary tests; this seems to point to a
problem that was recognized by Sun at some point and fixed in hardware.
(Note that this rules out interaction of the Sky board with other boards,
as we did a CPU swap between machines, we did not install the Sky board in
the other machine). Take-home message: the later the rev the better?

Someone whose name I am not sure I should mention has been extremely helpful
under the circumstances in forwarding to me an "unofficial" Sun patch for
this (?) problem. This patch was mentioned in the original summary.
It did the trick for him. We have not yet installed it (as I say, one of
the boards we had problems with now appears to work, the other is in a
machine that is down for other reasons right now). I am not sure what to do
with this patch if I get swamped with requests for it (yeah, right). If
justa few requests come in, I might pass it on. Of course I have no idea of
what the "legal" implications of this are, though I would like to think we
are all here to help each other. So let me just say "if it bombs your
machine don't blame me" or whatever the legalese is for that.
Carlo.

Previous Summary:
-----------------
From: dj@astro.lsa.umich.edu

From: "barking at airplanes" <rodney@snowhite.cis.uoguelph.ca>
Subject: SUMMARY: Bus Error Reg 80<INVALID>

SUMMARY: "Bus Error Reg 80<INVALID>","bus error reg 80<invalid>","BUS ERROR REG 80<INVALID>" <- to make life easier with grep.

        A few days ago I posted asking if others had more info on the
error "Bus Error Reg 80<INVALID>". This was/is occurring on a Sun 3/50
running 4.1 (no patches). I suspected hdwr as the cause, but from the
responses I've gotten it is the OS. The problem exists in 4.1 and 4.1.1.

 doug@aer.com (on a 3/60, OS4.1.1) said the problem is a memory page not
being allowed to write out to where it wants to go.

 ntitley@axion.bt.co.uk reported that Sun told him it
was "a known problem" (at least for his 4/330) and 4.1.1 would correct it.

 However, kirk@zabriskie.berekeley.edu who has a
4/280 running 4.1.1 said he had the same problem and that Sun had sent
him a patch to correct this. He's in the process of installing it, so he
couldn't say how well it worked. The patch has not been posted by Sun.
(But, gee it would nice :) ). The patch is supposed to be applied to 'locore'.

Snatches from their e-mail are given below.

-Rodney
rodney@snowhite.cis.uoguelph.ca
Department of Computing and Information Science
University of Guelph
Guelph, Ontario ph#: (519)824-4120 x8136,x4297
CANADA fax: (519)837-0323
N1G 2W1

From: Douglas W. Johnson <doug@aer.com>
Douglas Johnson dwj@aer.com

>symptoms you described. My machine is a used 3/60 (SunOS 4.1.1, one
>141x60Mb shoebox, standalone) that I just recently bought from Apex.
>So far I've learned that the "Bus Error Reg 80<INVALID>" message is an
>error when the system tries to write to a memory page that it is not
>allowed to (this is what the tech engineer indicated). I changed out

From: Nigel Titley <ntitley@axion.bt.co.uk>

>We currently have this on our 4/330. It definitely isn't hardware, we've had
>everything possible changed. Sun now say that it is a known bug with 4.1 and
>will be fixed by upgrading to 4.1.1. I'll be trying this in a couple of weeks.

From: kirk@zabriskie.berkeley.edu (Kirk Thege)

>We're seeing this a lot on two of our 4/280's which we recently upgraded to
>SunOS4.1.1. Sun has sent a patch to locore which I have not yet installed
>(and is not yet a released patch). They might have a similar patch for
>your 3/50.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:17 CDT