SUMMARY: resetting ecc handling?

From: Jim Van Verth (jvverth@figaro.med.harvard.edu)
Date: Fri Mar 13 1992 - 13:27:21 CST


About a week and half ago, I posed the question:

>I seem to have a problem that I haven't seen here before, and I can
>find no reference to in the manual. Every minute or so, I keep
>getting the error
>
>Feb 29 14:05:19 figaro vmunix: mem0: soft ecc addr 1b0a120 syn 26<S8,S1,S0>
>bit 10 U2014
>Feb 29 14:05:21 figaro vmunix: resetting ecc handling
>
>Other than the time, the message does not change.

The basic consensus was that it was a bad memory board. Called Sun on Monday,
they replaced the memory board and that took care of it. Thanks to all who
replied, responses follow:

--------------------------------------------------------------------------
From: geertj@ica.philips.nl

U2014 on your memory board is bad. Ask SUN for a new memory board.

--------------------------------------------------------------------------
From: B.Rea@csc.canterbury.ac.nz

I wouldn't swear to it but I beleive that you get this when you get
a recoverable error in the RAM, with parity memory you would most likely
get a system crash. With ECC memory the extra bits allow the error
to be caught and corrected. There are always occasional bit errors in
RAM but it you're getting them every minute, I'd say you've got a bad chip
in there somewhere. All those delightful numbers are telling you where
there error was. Time to call for the fixit man I would guess.

------------------------------------------------------------------------
From: birger@vest.sdata.no ( Birger Wathne)

This error msg says the memory module at U2014 on your memory board
fails at bit 10. The ECC correction manages to salvage your data
(That't why you pay for that ECC memory, isn't it?)

You should also see the LED marked CE light up (Correctable Error) on
the memory board.

Action? Have Sun fix the board, i guess. As long as the address is
always the same, the memory chip responsible should be easily
locatable. Look for U2014 on the PCB.

--------------------------------------------------------------------
From: Eckhard.Rueggeberg@ts.go.dlr.de

This seems to be a bad memory chip, where U2014 should be its location on the
memory board. I never saw the original sun 4/4x0 memory board, but if you can
shutdown your server :-( have a look which kind of chips you need. I hope they
are not soldered !

--------------------------------------------------------------------
From: anthony@irene.Jpl.Nasa.Gov (Anthony Martin)

The only time I encountered this problem was on a 3/280 with lots of memory
suplied by Sun. I went round and round with Sun tech maintenace
and we ran the diagnostics till we were blue in the face. They refused
to replace the memory because they couldnt get their diags to point
the finger at the memory board. I borrowed a comparable memory
board from another group, swapped it out, no more problems.......
Like I said, I only saw this once and it turned out to be a memory
board failure. Hope this helps.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:38 CDT