SUMMARY:IPI disk error messages

From: Daniel Hurtubise (daniel@CANR.Hydro.Qc.CA)
Date: Thu Jul 02 1992 - 22:20:32 CDT


Sun managers,

My original posting:

>I have been receiving the following error messages on
>my SparcServer 4/390 console lately from disk0 (IPI disk):
>
>id000g: block 1772844 (1985634 abs): read: Conditional Success. Data Retry Perf>ormed.
>
>This message is repeated many times for different blocks.
>
>(I also had the same message coming from partition id000a.)
>
>I had the disk replaced and the messages have come back for partition
>id000g only. Is this really a disk problem or could the error be
>caused by something else such as the controller?
>
>There are only two disks per controller, the cables seem to be
>attached properly and I can't really say that disk activity is
>really high as we have about 50 users divided equally on two
>servers where the other server is running fine with basically the
>same configuration.
>
>I've been told that a format will correct the problem 80% of the time
>but I'm not convinced that this is the solution since two different
>disks have yielded the same error.

Here is my summary.

There was no absolute answer to this problem. I ended up upgrading my
controller to the latest rev. and replacing and formatting the disk.

Since then, I've had no error messages.

By the way, even if sun says to disregard the message in its open
issues, I stronlgly suggest to keep a close eye on your system if
the messages do appear. If the frequency of messages increases, it
is in your best interest to check the revision of your controller and
check your disk.

Some of the replies I have received follow.
Thanks to all who replied!

===============================================================================
The answer to this error message is swap the controller. Either upgrade
( if you have an older revision level ) your IPI controller to a rev level
of 8 or greater ( present rev is 09 ) or change the present controller with
a new one ( if the rev is current ).

After you swap the controller you need to reformat the drive and make newfs
for each filesystem and restore the data. Theoritically only making newfs
should work since all the corrupted inodes in the filesystem will be cleaned
and recreated but by doing a format at this level you don't lose much except
an extra hour. Instead you will be sure if there are any new defect blocks.

Remember one thing if the errors are too many on the disk then you may
have problems backing up the filesystems. I suggest you back your data
and test by restoring on a spare partition if it is okay.

Note: There are three different part numbers for IPI controllers. They are
      
         501-1313 ( upto rev 04 )
         501-1539 ( upto rev 09 )
         501-1855 ( upto rev 03 ).

 Although Sun claims that 501-1855 rev 02 is equivalent to 501-1539 rev8 and
1855 rev 03 is equiv. to 1539 rev 09. I suggest that you get part number
501-1539 rev 09.

 Anil Katakam
 AT&T Bell Labs,
 67 Whippany Rd.,
 Room 2a-101
 Whippany,NJ 07981

==============================================================================
Did you reformat the new replacement disk when you got it? I suspect not. This
may sound ridiculous but it seems that Sun has shipped a good many of these
IPI's with a sub-optimal format. I don't know the details, this is just what
I've heard.

I just finished going through the hell-ish end of what you are just beginning
to see. The messages will start appearing on one partition for about a month
or so. Then you'll start seeing them on multiple partitions on that disk.
Then multiple disks. You'll have that controller replaced, only to find that
they start appearing on the other controller. Now you've got them by the
hundreds each day. Its difficult to schedule downtime to try to remedy the
problem because the machine is in a Medical/Clinical setting and direct
patient care is involved. Soon the "Conditional Success"'s turn into at least
3 failures/day with some abount of data corruption (can you say "backups").
After several visits by the Sun techie ( actually we get a Bell Atlantic
flunkie contractor) who has no clue, you decide to reformat all 4 gigabytes of
IPI in one fell swoop. 22 hours and 3 Gb of restores later, you cross you
fingers and take the next day off to get some sleep.

I went through that from November '91 to May 9th. I haven't seen the messages
since.

Good Luck,

Phil Antoine (antoine@RadOnc.Duke.EDU)
Duke University Medical Center
Radiation Oncology Physics

===============================================================================
Page 104 of the SunOS 4.1.1 Release and Install Guide states the
following:

> Error Messages During Heavy Disk Activity (1036367) [4.1]

> During heavy disk activity, error messages similiar to the one below
> may appear. They can be disregarded.

> Apr 9 13:43:46 muishu vmunix: id003h: block 849694 (849694 abs):
> write: Conditional Success. Data Retry Performed.

Although Sun says to ignore them, I only ignore them to a point. If you
see many errors in a specific area, I would recommend a read/write verify.
I've done this in the past and found errors on the disk that Sun told me
to 'disregard'.

In addition, I can tell you that the problem is not fixed in 4.1.2.
My 4.1.2 690 wih 24 IPI's still spits out about 10 a month. I schedule
disk maintenance quarterly and run read/write verify on all drives. I
usually find something.

Hope this helps
_____________________________________________________________
Steven Giuliano Voice: 215/241-4296
                               Fax: 215/241-4239
Independence Blue Cross
1901 Market Street UUNET: uunet!smw002!steve
Philadelphia, Penna 19103 Internet: steve@ibx.com
_____________________________________________________________

===============================================================================
I've had this problem on sixteen IPI disks and in all cases reformatting
the drive was all that was needed. At first I used analyze to find all the
problems and spare those sectors but after awhile I had quite a few bad
blocks! Then I took a closer look at the signals from the IPI disks and
after talking to some people at Seagate I think I understand the problem.

IPI disks usually have multiple surfaces (one & sometimes two heads per
surface) with one surface acting as a servo (timing marks, etc.). At the
beginning of each track on a surface there is some additional servo info
to allow the head to settle after a head switch (heads are accessed one
at a time in a given cylinder).

The problem is that due to age, drive orientation, etc. the timing marks
at the start of a track or the sector gaps or etc. which are written during
low level format are slightly displaced from where the head wants to be
(e.g. relative to servo stuff). Thus tracking errors cause the signal to
be marginal and soft or hard errors result.

Seagate said the 6 Mbyte/sec dual head/surface eight inch drives are very
sensitive to this problem (this is the drive used in the old 4/470 pedestal).
The newer 5 1/4 inch drives are better but any drive can have this problem.

The solution is to extract the original defect list from the disk and then
do a low level format. After the format do a read/write/compare analyze
on the whole disk with at least 8 patterns (non-random). The Seagate people
recommend reformatting a disk at least once a year as preventative maint.
for these kinds of problems.

As a note - when you get a new IPI (or SCSI disk) its best to install it and
then let it run for a few days before formatting it to allow it to "age" in
the "as used" environment. This will limit the amount of "data creep" that
can occur.

---
Terry F Figurelle			Boeing Defense & Space Group
email: tff@plato.ds.boeing.com		PO BOX 3999, Mail Stop 8H-56
phone: 206-773-9987 fax:206-773-3542	Seattle, WA 98124-2499

==============================================================================



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:44 CDT