SUMMARY -DISK ERRORS

From: Everett Schell (schell@molbio.cbs.umn.edu)
Date: Sat Nov 05 1994 - 07:41:03 CST


   ORIGINAL QUESTION:

> It's happening again; when we reboot our 630MP SparcServer:
> The SCSI bus usually gives error messages for a few minutes, then it
> settles down to a few messages in a week. But two weeks ago and today
> it never really clears up. Two weeks ago it appears to be be related
> to locking up the whole system after two days. Here is a sample of the
> messages:
>Oct 28 11:53:41 molbio vmunix: sd3e: Error for command 'write'
>Oct 28 11:53:41 molbio vmunix: sd3e: Error Level: Retryable
>Oct 28 11:53:41 molbio vmunix: sd3e: Block 662208, Absolute Block: 1177800
>Oct 28 11:53:41 molbio vmunix: sd3e: Sense Key: Aborted Command
>Oct 28 11:53:41 molbio vmunix: sd3e: Vendor 'SEAGATE' error code: 0x47
>Oct 28 11:54:08 molbio vmunix: sd3g: Error for command 'write(10)'
>Oct 28 11:54:08 molbio vmunix: sd3g: Error Level: Retryable
>Oct 28 11:54:08 molbio vmunix: sd3g: Block 968176, Absolute Block: 5433868
>Oct 28 11:54:08 molbio vmunix: sd3g: Sense Key: Aborted Command
>Oct 28 11:54:08 molbio vmunix: sd3g: Vendor 'SEAGATE' error code: 0x47
> . . . . . .
>
> The blocks appear to be distributed over the whole disk and all three
> mounted partitions. I did format testing on one small partition and found
> no bad spots at all.
>
> Last time, rebooted after lockup, it settled down for over a week, until
> we rebooted again this morning. This is our main server with hundreds of
> users, so its not a simple matter to reboot on short notice.
> We're afraid it will lock up and crash the system in a day or two like
> before.
>
> Hardware details: We replaced the Sun CPU modules with ROSS RT625
> HyperSparcs a few months ago. The disk is a Seagate ST43400N 2.9 Gig
> which was new last Winter. It is in an external enclosure with a
> ST10800 Elite 9, plus an external Exabyte and a Sun CDROM. We also have
> two HP 3010 disks on a second SCSI bus Sun FSBE card.
> Perhaps the three external devices make the chain too close to the SCSI
> 2 limit? Although this problem was very minor for several months until
> two weeks ago. Someone suggested to update the ROM chip on the disk,
> but the disk is less than a year old. What about some ROM chips on the
> mother board -- some of them came with the ROSS modules.
>
> Any advice? I'm going to try and shorten the longest of the SCSI cables
> on this chain.
>
---------------- RESPONSE SUMMARY ----------------------------------

1. Shorten cables. [This is all I've done so far; it cut errors by 98%]

2. Get "impedence matched cables".

3. Check all connections involved with the scsi chain. ["reseating" the
   connections is a good general procedure that has also helped solve RAM
   and mother board problems in many machines]

4. Try replacing the terminator and/or the cables.

5. Use fsck with "certain" [unspecified] flags to make errors go away.
  
6. Replace third party external enclosure, as they sometimes do not work
   well with fast SCSI.

7. Replace the disk before it really goes bad. [hopefully, the other steps
   will show that the disk is actually healthy, when the error messages go
   away completely]

8. Bad blocks show that the disk needs to be repaired with the format command.

    
   After replacing a six foot cable with a 3 footer, the errors dropped from
  50 or 100 per hour to one or none. Kevin Sheehan's suggestion about the
  impedence matching of the cables made me remember that the exabyte drive
  was replaced a week before this started. The new drive must have different
  intermal electrical (impedence) properties that pushed the previously
  functioning scsi chain "over-the-edge". But the problem didn't arise until
  rebooting, when the bus seems to go through a self-adjusting phase and then
  finally settle down to few or no errors. And this time it wasn't happening...

   So, It's much better for the time being, I will
  try to reduce the cables by another foot or so, and try moving the exabyte
  to the other scsi bus, an obvious move that was too simple to think of.

  
 Thanks for the many responses, I may have lost one or two. but:

Kevin.Sheehan@uniq.com.au
Chris Schmechel schmec@med.unc.edu
harishm@pcsdnfs1.eq.gs.com
perryh@pluto.rain.com
Zia iqbalz@cnt.gs.com
Dave.Curado@HK.Super.NET
vincent.everett@mrc-applied-psychology.cambridge.ac.uk
Bruce Harrell bharrell@digit.com
Haiquan Dai daili@csbnmr.health.ufl.edu
mike_raffety@il.us.swissbank.com
grantb@physio.wa.com
Gene Loriot epl@Kodak.COM
mike_raffety@il.us.swissbank.com
Bonnie Lucas lucas@Nadn.NAVY.MIL
Louis M. Brune louis@meg.meg.saic.com
Grant Bohnet grantb@physio.wa.com
John DiMarco <jdd@cdf.toronto.edu>

  Everett Schell, SysAdm, MBCC, Biological Sciences, U of Minnesota



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:09:14 CDT