SUMMARY: write restore (disk sequencer error)

From: Patrick Lynn Shopbell (pls@pegasus.rice.edu)
Date: Fri Apr 10 1992 - 19:01:32 CDT


Thank you everyone for your prompt replies. A final solution has not
been reached, but I think it warrants a summary at this point.

The original question was:

> Hardware:
> SUN 3/280, Sun OS 4.1.1
> Xylogics controller - Fujitsu duoble eagle M2361 drive (580 Mby)
>
> Problem:
> About one week ago, we started getting periodic errors:
>
> xy0g: write restore (disk sequencer error) -- blk #346683 abs blk #486043
>
> The block numbers and partition appeared to be random in nature. As
> it got worse - probably 100 messages over 5 days - we brought the sytem
> down, reformatted, and attempted a backup. The reformatting went fine -
> no errors, but the above error started reappearing upon an attempted
> restore. The sytem could not be brought back up.
>
> Thus, we had the drive replaced. We repeated the process -> same
> result ! The formatting went fine, but a slew of about 8 of the above
> errors appeared when trying to restore. The computer then crashed with
> a bus error and locked up. An L1-A was required to regain any control.
>
> Our guess now is that it must be the controller board. We have
> added a SCSI board in the rack beside the drive board recently, although
> that was about a month ago, so it doesn't seem likely to be the source of
> the problem, since the disk errors only started last week.
 

Most suggestions pointed towards the controller board or the cables. Following
the procedures taken above, these items were also both replaced - with no
effect. The remaining applicable suggestions were to replace to motherboard,
which we probably won't do at this point, and an idea that when we added the
SCSI board the backplane voltage levels were pulled low enough to cause problems
with the Xylogics board. I find this unlikely only because we ran fine with
all boards in there for 2 or 3 weeks before experiencing any errors.

In the end, we brought the system back up, restored all files (receiving about
8 of the above errors), and are continuing operations. No errors have been
recorded since. If the problem re-surfaces, as it has for many who responded,
we will tackle it again.

More detailed responses are included below. Thanks again, especially to:
mp@allegra.att.com (Mark Plotnick)
yih@atom.cs.utah.edu (Benny Yih)
cgrady@eng.auburn.edu (Charles Grady)
aahvdl@eye.psych.umn.edu (Andrew Luebker)
jaapb@philfa.pfa.philips.de (Jaap Bottenberg)
Brian.Zavatsky@Corp.Sun.COM (Brian Zavatsky)
guyton%condor@rand.org (Jim Guyton)
kalli!kevin@fourx.Aus.Sun.COM (Kevin Sheehan {Consulting Poster Child})
ldavis!woden!gunn@snowbird.Central.Sun.COM (David Gunn)
mark@maui.Qualcomm.COM (Mark Erikson)
kross@analogous.com (Ken Ross)
dit@maths.aberdeen.ac.uk (David Tock)
derek@aivru.sheffield.ac.uk (Derek Jones)

Patrick

*----------------------------------------------------------------------------*
| Patrick Shopbell Department of Space Physics and Astronomy |
| pls@pegasus.rice.edu Rice University |
| P.O. Box 1892 |
| (713) 527-8750 x3640, x3511 Houston, TX 77251-1892 |
*----------------------------------------------------------------------------*

-----------------------------------------------------------------------------
From: mp@allegra.att.com (Mark Plotnick)

it is likely that it's a bad controller. or possibly
the backplane's power supply isn't putting out sufficient voltage.

-----------------------------------------------------------------------------
From: yih@atom.cs.utah.edu (Benny Yih)

        Wow, deja vu. I had about 5 rounds of the same thing (if I recall
correctly). Our olde 4/260 had a CDC & a Fujitsu slung of a xylogics, all
from Sun. After replacing the Fujitsu about 4 times (including replacing both
w/ fujitsu's so both failed), the repairman finally gave us CDCs to replace
them. Apparently, there was a bad batch from the factory at some point
which eventually degassed volatiles into the drive housing & would fail.
The economics were such that it was cheaper to have customers tell them which
drives died than check their inventory explicitly. After reformatting, they
would work fine for a few days or weeks & die again.
        Perhaps more likely is to double check the cables. They sometimes
cause intermittent failures.

-----------------------------------------------------------------------------
From: Charles Grady <cgrady@eng.auburn.edu>

Being a X Sun support engineer, Your problem is the XY controller

-----------------------------------------------------------------------------
From: "Andrew Luebker" <aahvdl@eye.psych.umn.edu>

We had a Sun-3/180 with the Double Eagle that was under Sun maintenance
for several years. The disk repeatedly failed at roughly one-year intervals,
but Sun just kept replacing either the whole drive or the HDA whenever the
problems occurred. I even received a memo from our field engineer, suggesting
that our problems were due to lack of humidity controls. (The computer was
usually failing during the dry winter months.)

When the machine died again, after the service contract expired, I posted
a message to sun-managers. At least two of the responses suggested that
the disk controller board was probably the culprit. Also, another person
told me to check whether the power supply voltage levels were okay.

Do your problems disappear if you remove the SCSI board? Maybe it is
pulling down your voltages enough to make the SMD controller goofy...

-----------------------------------------------------------------------------
From: jaapb@philfa.pfa.philips.de (Jaap Bottenberg)

Last year we got an analoge problem.
It was on the IPI-bus, got the same error's, replaced the disks and after a month
the story repeated: errors, disk swap a month O.K. and then again diskerrors.
So it could't be the disk.

The problem was the connector. One of the clamped connecters on the flatcable
made a bad contact. Replacing the cable solved all the problems.

-----------------------------------------------------------------------------
From: Brian.Zavatsky@Corp.Sun.COM (Brian Zavatsky)

I would suspect the command cable from controller to
disk drive.

-----------------------------------------------------------------------------
From: Jim Guyton <guyton%condor@rand.org>

Yeah, we've had lots of xy450/xy451 failures over the years. Also
beware of cables! I particularly don't like the SMI "bulkhead"
stuff (those large DB cables) and personally prefer to bypass them
and just those nice ribbon cables directly between the controller
board and the drives.

-----------------------------------------------------------------------------
From: kalli!kevin@fourx.Aus.Sun.COM (Kevin Sheehan {Consulting Poster Child})

In my experience, that is almost always the controller going. Replace
the controller and you should be fine.

-----------------------------------------------------------------------------
From: ldavis!woden!gunn@snowbird.Central.Sun.COM (David Gunn)

Last August we installed a SPARC2 at a customer site. The SPARC had 1
disk. In December we added a second disk. Very soon we started seeing
all sorts of panics, bad traps, "initiator detected error" messages. Of
course I suspected the new disk. But as we investigated it (getting
another new disk, swapping things in & out of the system) we found that
we started getting them when we were just running with the original disk &
configuration.

Finally we decided to swap out the mother board and guess what? The problem
went away and we haven't seen it since.

Perhaps your problem is somewhere else in the system and is affecting your
disk.

-----------------------------------------------------------------------------
From: mark@maui.Qualcomm.COM (Mark Erikson)

        We had a similar problem. We replaced the controller each time errors
    appeared and each time the messages would stay away for weeks to months.
    After almost a year of this (I was around for the last two controllers)
    the cable was replaced. Turned out that one pin on the cable was going
    bad and finally failed. Each time we plugged in the cable it fixed the
    problem until it was jarred. Thus, each controller replacement, the errors
    went away.

-----------------------------------------------------------------------------
From: Ken Ross <kross@analogous.com>

We had some problems with our Eagles a month or so ago and, after suspecting
first the drives and then the conrtroller card, we determined that the problem
was really with the cables that go from and between the drives and the
controller card. The error message that you are getting is not exactly the
same as ours, but apparantly breaks in different cables and in different wires
within the cables result in different symptoms (at least according to the
consultant who came in, diagnosed our problem, and replaced the cables). His
explanation was that the cables have lots and lots of little wires packed into
them and, over time, the weight of the cables themselves pulling against the
connectors will cause some of the internal wires in the cabling to break.

So, I don't know if this is your problem or not, but it's one more component
to suspect as you try to debug - and the cables are cheaper than the controller
card if you just want to swap things until you get the problem fixed.

-----------------------------------------------------------------------------
From: David Tock <dit@maths.aberdeen.ac.uk>

This sounds very much like the problems we had with out 3/160, xylogics
controller and 280Mbyte disks. The verdict was either loose cabling (not in
our case) or a dying disk drive. Sun replaced the disk and we were OK for
about a year. Then the other drive did the same. Sun replaced it. The
replacement lasted a couple of days, then the same again. THe next replacement
lasted a couple of months, by which time we were off maintenance and we have
not done anything with it since. (Why bother with 3.5" disks as cheap...)

The point to note is that if the disk was replaced on maintenance, it was
almost certainly a used disk that had been reconditioned, and possibly DOA.

You could try swapping the drives over (xy0 <-> xy1) to see if the problem
stays with the controller channel or the disk...

-----------------------------------------------------------------------------
From: derek@aivru.sheffield.ac.uk (Derek Jones)

Cables or controller

-----------------------------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:40 CDT