Summary: Ecache parity problem on Sun Enterprise servers

From: Tony Tran <tonytran_at_contour.com>
Date: Tue Apr 23 2002 - 20:25:09 EDT
My thanks go to:

Octave Orgeron <unixconsole@yahoo.com>
Paul Keller <pkeller@cisco.com>
Kailashnath Rampure <kailash@tivo.com>
"Mike's List" <mikelist@sky.net>
Hichael Morton <mh1272@yahoo.com>
Serguei Borkov <sborkov@yahoo.com>
"Troy Abernathy" <tabernathy@r2-tech.com>
mike.salehi@kodak.com
Tim Chipman <chipman@ecopiabio.com>
"Miller Sutfin" <millersutfin@earthlink.net>
"Karl Vogel" <vogelke@dnaco.net>

Many think that this is  a hardware issue that has something to do with
the
faulty design of the faster chip (400 Mhz or faster)  aggravated by
heat.
The "bad" combination seems to happen to servers with multiple 400/450
Mhz
cpu's (sun cluster?) in the server room that is not very cool or not
well ventilated.
Replacement of the defective CPU usually fixed the problem.

Octave Orgeron seems to  summarize it best:

" ... Your question about the E-Cache Parity problem on the
400Mhz USII CPU's is a fun question to answer. The
problem came from a manufacturing error from
Solectron, who assembles the CPU module, TI makes the
CPU. They used some sub-standard cache modules from
IBM that caused issues and the thickness of the PCB
was not right. As a result, anything from heat to
radiation could cause the E-Cache error. There are two
paths to fixing this.. one is install the patch, that
disabled the E-Cache.. this causes *serious loss*  in
performance. The other path is to get a replacement
CPU module, make sure that it's built in Canada, those
are the good CPU modules.. it'll say "Made in Canada"
on the side:) "

Paul Keller <pkeller@cisco.com> indicates that Sun has the
new mirrored Ecache that addressed this particular Ecache problem:

 "... I feel your pain.

  Sun eventually came out with a 400MHz processor that came
  with mirrored eCache. That seems to have helped the 400s
  .... But, we've been seeing a lot of the same problems with
  the 440s that run on the Netra hosts. To my knowledge, they
  haven't dealt with that yet. "

Serguei Borkov <sborkov@yahoo.com>:

Had it before, done 2 things: replaced CPU, and
rearranged environment to run at about 35C on CPU.
Problem seemed to be gone.

And according to Troy Abernathy <tabernathy@r2-tech.com>, a Sun
reseller:

Based on my understanding from my engineers and clients, the revision
501-5661 and above for the 400 MHz processors eliminated the problems.
As
far as using a patch, I am pretty sure that there is no such fix.  He
can
either check all of his CPU's to see what their part numbers are, or
have
them replaced.  That can be a very expensive task though.  I am a
reseller
and I have the CPU's listed for $995.  If he has many to replace that
could
be painful.  If he needs additional help or would like to speak with one
of
my engineers let me know and I will hook them up.  Good luck.

Tim Chipman <chipman@ecopiabio.com> provides yet another perspective
about this problem: Sun will not replace the CPU until the same chip
crashes
a couple of times (now try that on a critical production Sun cluster
with many CPU's)
There is no way one can predict this in advance (even with SunVTS
diagnostics).

" ... The "solution" from sun (in my experience):  If a single CPU has
more than 2
hits of the e-cache error, the part is considered flawed and is
replaced.
AFAIK, there is nothing in software that can be done.  The later rev of
the
CPU is a model with "mirrored E-Cache", and isn't prone to this fault.
However, I don't think there is any fix for older (affected) parts which

are showing symptoms - other than replacement.

Even more fun, I'm not aware of any way to detect the potential for the
problem other than wait for it to strike.

We have a 3500 here (4 CPUs) and endured numerous e-cache related
crashes,
because different CPUs were doing it. Sun wasn't willing to replace any
parts until a single specific CPU showed "multiple failures" ... I can't

imagine what somebody with an 8-cpu (16?) cpu system would do - simply
wait
patiently until 16 or more e-cache related crashes happen, and then
force
sun to replace all the parts en masse ... ?

Clearly, I am not so amused with the entire issue. If Sun was doing the
"right" thing, they should pro-actively replace all old chips at risk
which
are still in service rather than waiting for people to suffer repeated
crashes and then get the replacement.  However, the chances of them
doing
this ... seem minimal. "

Suprisingly a couple of  people reported that the Sun's "memory scrubber
patch"
seem to work for some ecache problem and some CPU's.

So far I have yet seen any FCO (field change order) from Sun.
This problem reminds me of the sticky head problem on the Pro Quantum
100 MB
disk drive or the intermitten problem with the Vixel fiber module.

Tony
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Tue Apr 23 20:30:44 2002

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:41 EST