SMD drive problems, SUMMARY

From: fabrice cuq (fabrice@yosemite.ATMOS.Ucla.EDU)
Date: Tue Feb 04 1992 - 21:27:18 CST


Hi,

A while ago, I wrote:

>One of the SMD drives (Fujitsu 2382K) is giving a lot of errors like:
>xd1h: write retry (header not found) -- blk #1263541, abs blk #1263541
>xd1h: write retry (header not found) -- blk #1263539, abs blk #1263539
>xd1h: write failed (header not found) -- blk #1263539, abs blk #1263539
>I recreated the filesystem with newfs, but after a while the errors came
>again.
>Is this a symptom for hardware problem with the disk? Does the disk need to
>be reformated?

I reformatted the disk and so far no problem. If you have only a few blocks
you can "repair" the disk without reformatting everything (in my case too
many blocks) as described in the summary. A couple of persons pointed out
that they eventually replaced their disks. Some said this problem might
occur when there is a change in temperature...

SUMMARY:

To identify which file belongs to a bad block

icheck -b <bad_block#> /dev/r...
this lists (amongst other things), the i-node that refers to that block
you can then do
ncheck -i nnnn /dev/r..
to find the file name(s) associated with that inode.

Someone send me this memo that contains useful info

                     What File Has The Disc Error?

                             by John Walker
                   Revision 0 -- December 21st, 1989

                                ABSTRACT
                                ========

        When a single block or contiguous area on a Sun (or
        other Unix) system's hard disc fails, one of the most
        obvious and immediately important questions that
        arises is "What file contains the error?". Amazingly,
        there is no simple, standard utility that answers this
        question, leaving the user knowing that some data have
        been destroyed, but not what. If backups are current,
        the user doesn't know what files to reload after the
        failed area is reassigned to an alternate track or
        made unavailable for allocation. This paper presents
        a cookbook procedure, based on information provided by
        Bob Elman, for determining which file contains a bad
        disc block.

                              INTRODUCTION
                              ============

When my hard disc presented me with its latest holiday surprise, I
ended up with 100% repeatable errors on a specific track, head, and
sector. Immediately after the error occurred, I ran an incremental
backup which, naturally, encountered read errors. At that point I had
a current set of backups from which I was perfectly willing to reload
or rebuild any files that occupied the area of the disc that had
failed, but I didn't know which files were involved. DUMP didn't tell
me, when it so kindly reported an error during the backup; even though
it clearly knows the INODE number it was dumping when the error
occurred, it didn't deign to print it.

Bob Elman explained the procedure one uses to find what file contains
a given disc block, and it worked just fine, telling me that the error
was in an executable file I could simply re-link after I'd fixed the
disc by reformatting the track that failed. Since the procedure is
less than obvious and nowhere explained in the Unix manuals I've seen,
I decided to write it down so I'd have it at hand the next time this
happened, and to help the next poor sucker victimised by a hard disc
failure. You might want to print this message on a piece of paper and
file it in your system administration manual--when you need it, you
may not be able to get it from a file on your disc.

                            FINDING THE FILE
                            ================

We start out knowing that a hard disc contains one or more bad blocks.
The first symptom that something is wrong is usually Unix console
messages reporting I/O errors on the drive. Most of these I/O error
messages give the block number that failed but since Unix reads and
writes large buffers, these numbers should be considered as giving
only the general area of the actual error. The first step, then, is
to identify the actual blocks that contain the errors.

What Blocks Are Bad?
--------------------

(Sun specific.) Initially, note the drive number from the disc error
message. In a typical message like:

xd1c: write failed (header not found) -- blk #1317140, abs blk #1317140

the drive name is "xd1c". To find out what file system this
corresponds to, type "df", which will print something like:

Filesystem kbytes used avail capacity Mounted on
/dev/xd0a 15502 1946 12005 14% /
/dev/xd0h 514106 430020 32675 93% /usr
/dev/xd1c 659242 569911 23406 96% /usr2
/dev/xd0g 42406 8554 29611 22% /var

In this case, you can see that "xd1c" is mounted as your /usr2
filesystem. (The default mounting of file systems is given by the
file /etc/mtab, which you can type.)

Shut down your system and bring it up single user with "b -s". In
single user mode, run "format". When you fire up format, it asks you
to choose the disc you want to work on; pick the one from the error
message. For example:

throop# format
Searching for disks...done
 
AVAILABLE DISK SELECTIONS:
        0. xd0 at xdc0 slave 0
           xd0: <CDC 9720-850 cyl 1358 alt 2 hd 15 sec 66>
        1. xd1 at xdc0 slave 1
           xd1: <Fujitsu-M2372K cyl 743 alt 2 hd 27 sec 67>
Specify disk (enter its number): 1
selecting xd1: <Fujitsu-M2372K>
[disk formatted, defect list found]

Here, I've entered "1" to choose "xd1". (The "c" in the error number
is a partition name, but at this level format is working on the whole
disc.)

Next, we want to get the physical disc address of the block number
reported in the error message. Enter the "show" command, and type in
the error block number:

format> show
Enter a disk block: 1317140
Disk block = 1317140 = 0x141914 = (728/2/54)

This tells us that the block where Unix encountered the error was on
track 728, head 2, sector 54. Since we don't know precisely where the
error was, we'll sniff around the two surrounding tracks for errors.
Enter the surface analysis command:

format> analyze

and then enter "setup" to specify the parameters for the analysis:

analyze> setup
Analyze entire disk [yes]? no
Enter starting block number [0, 0/0/0]: 727/0/0
Enter ending block number [1347704, 744/26/66]: 729/$/$
Loop continuously [no]?
Enter number of passes [2]: 1
Repair defective blocks [yes]? no <========= INCREDIBLY IMPORTANT!!!! <===
Stop after first error [no]?
Use random bit patterns [no]?
Enter number of blocks per transfer [126, 0/1/59]: 1
Verify media after formatting [yes]?
Enable extended messages [no]?
Restore defect list [yes]?
Restore disk label [yes]?

Here we've set up to scan from the start of track 727 through the end
of track 729 (the "$" means "the highest number valid in this field"),
reading single sectors. If we were to use a larger blocks, the
precise location of the errors would be indeterminate. IT IS
ABSOLUTELY ESSENTIAL, SURPASSINGLY SO, THAT YOU ANSWER *NO* TO THE
"REPAIR DEFECTIVE BLOCKS" PROMPT. If fail to do this, the so-called
"read-only" test will go ahead and "repair" blocks on your disc,
possibly causing loss of data in files. So much for reasonable
defaults!

Now select the read-only surface analysis:

analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? yes

This will scan the tracks you've specified. Since we're only looking
at a few tracks, the comment about taking a long time is another lie.
This command should report the individual sectors with errors. If it
doesn't, welcome to the world of transient disc errors. If it does,
note the track, head, and sector numbers of all failing sectors on
paper, then leave the analyse command:

analyze> q

You can then convert those addresses back to block numbers with the
"show" command:

format> show
Enter a disk block: 728/2/22
Disk block = 1317108 = 0x1418f4 = (728/2/22)

Once you have the failing block numbers in hand, you're done with
format. This example has been for a disc with a single partition that
fills it entirely. If your disc has multiple partitions, you'll have
to convert these absolute block numbers to relative numbers based on
your partitioning of the disc. The partition/print command will show
the current partitioning, which can use to bias the cylinder numbers
into their partition-relative addresses.

What I-Node Owns That Block?
-----------------------------

On Unix, there is no one-to-one mapping of file names to areas on the
disc, since "hard links" can result in a given disc area belonging to
any number of named files. The Unix object that most closely
corresponds to the notion of a file in most operating systems is
called an "I-Node", and it's expressed as a number. The utility
"icheck", which was part of the semi-automatic assault guru-driven
file recovery facilities of Unix later largely supplanted by "fsck",
has the ability to determine what I-Node points to a given block. If
you know, for example, that blocks 1317108 and 1317110 on disc "xd1c"
contain errors, use the command:

/usr/etc/icheck -b 1317108 1317110 /dev/rxd1c

Bizarre, isn't it? It just scans numbers until it hits the "/" at the
start of the disc name. We specified "rxd1c" because naming the "raw
device" makes icheck run faster.

Icheck will crunch for some time, and if the specified blocks are part
of a file, it will print a line that gives, among other things, the
I-node of the file(s) that contain the given blocks. Note the I-nodes
on your paper, next to the block numbers. If no I-nodes were reported
by this procedure, the error block is not part of any currently
existing file.

What File Name(s) Correspond To That I-Node?
--------------------------------------------

With the I-Node number in hand, we can finally find out what file was
hit. If "icheck" has told us the error is in I-Node 87055, we use the
command:

/usr/etc/ncheck -i 87055 -a /dev/rxd1c

to find the file name. After a while, this will print something like:

/dev/rxd1c:
87055 /usr2/kelvin/acadexe/acad

and at last, the inscrutable is unscrewed! The error was in the
AutoCAD executable file, which I can simply re-link. If the file
hadn't been one so easily recreated, it would have to have been
reloaded from the most recent valid backup. Note that if a backup was
made after the error occurred, and that file was present on the
backup, an earlier backup should be used since the copy on the
post-error backup is almost certainly bad.

You can use "ncheck" to search for multiple I-nodes on one pass. For
example:

/usr/etc/ncheck -i 4142 4131 4102 -a /dev/rxd0g
/dev/rxd0g:
4102 /tmp/vm_fonts-n0
4131 /tmp/tty.txt.a00444
4142 /tmp/rmail

Repairing And Reloading
-----------------------

After the location and scope of the damage are established, you should
repair the disc errors and restore the damaged files. Since repair
procedures are highly system-dependent and, even on Sun systems,
differ depending on the type of disc controller and drive installed,
you must refer to the hardware documentation for your system for the
appropriate procedures.

Note that the Sun documentation talks about "repairing" sectors with
errors. Nobody I know can say for sure precisely what this means:
whether it's a process of assigning that sector's address to another
sector on an alternate track, clearing its availability bit in the
current bad spot list, marking it in the original defect list, or
what. In addition, the problems I encounter most frequently on hard
discs are destroyed headers due to failed writes (for example, when
the power fails during a write), which are best fixed by reformatting
the area containing the errors rather than discarding sectors which
have no physical defects.

In any case, after you've repaired the problem with the disc, you need
to delete all the files containing destroyed data and reload them from
their most recent backups. As noted above, don't use any backups of
error-containing files made after the error occurred, as they probably
contain the same errors as the disc controller was complaining about.

*********end of memo *********************

--------------------------------------------------------------------
It is time to reformat the disk, the magnetic signal on the disk is getting
weak and/or some spots are going bad.
--------------------------------------------------------------------
I just went through this with a 2372. I ended up declaring the disk
Dead, and replacing it. There are a couple of things you can try:

Unmount any partitions on that disk and run the "format" command.
Choose "analyze" from the main menu, and then choose "read" under the
analyze menu. That will scan the disk, without disturbing the data on
it, and (try to) repair any bad sectors it finds. If that succeeds,
you may be back in business.

If analyze/read doesn't do the trick, you might try reformatting,
though if read can't repair a problem, I don't know if reformatting
will repair it either.
--------------------------------------------------------------------
        This is a problem that I have been fighting for a long time!
        Sun has replaced drives for me several times, but the problem
always comes back! For us, the drives eventually become unaccessable.
Sun, finally, removed some baffling in the drive pedestal, and we pulled
the unit out away from the wall. The problem is under control, now, but
if the temperature in the lab where the unit resides goes over 74 deg. Far.,
we get the errors. Sun has no idea what to do at this point. If the
errors come back again, I am going to have sun replace the pedestal. They
have worked on this unit several times over the last 6 months.

        I have heard of other people with 4/370's having the same problems,
but have never heard of a resolution. I think this is a chronic problem
with this system!
--------------------------------------------------------------------
It sounds like you may be starting to get a problem spot on the disk.
I suggest that you use the format program to analyze (with repair) the
area in the area of the report abs blk (the block # reported in the
message is approximate, so use a reasonably wide range). If you catch
the problem now (and stop using the area (which format will arrange
spares for you automatically)), then your problems *may* be solved.

With the "setup" command in analyze, you can specify the range of
blocks to be checked, and the grunularity of each test (make it small
(this is slow, but if you limit the range to be checked, it won't be
too bad), set it to automatically repair any problems.

Use the "test" mode (pattern testing) of the analyze command which
does require that the disk be unmounted (all partitions), but won't
harm the data that is there (unless the errors are unrecoverable, in
which case the data is already lost).

You should of course backup the disk first, but you probably won't
lose any data if you give the right commands to the format program.
--------------------------------------------------------------------
Reformatting will help. If that's too much bother, do a backup of
the partition(s) containing the flagged block(s), and then do a
5-pass surface analysis of those cylinders.
--------------------------------------------------------------------
1. You will probably have to reformat the disk to map out the bad blocks.
2. You will quite likely have to replace the disk in the near future.
Every time I saw these errors crop up on Eagles (2351) or Double Eagles
(2361) drives, it preceeded their demise. Keep your backups up-to-date :)
--------------------------------------------------------------------
I have encountered the same problem on occasion. My solution was to
reformat the drive. That has cured the problem for over 1 year.
Suns solution (hardware support) is to replace the drive.
--------------------------------------------------------------------
Yes, this would indicate that at least a few good analyze passes may be in
order. You may not need to format. You can probably repair these using the
format program.
One thing to be certain is that you have indicated to format the CORRECT number
of bytes/sector or bytes/track. Without the correct info, the proper block
cannot be located correctly, and you will not repair the right sectors.

This is a symptom of new bad blocks found due to a disk hit. You may
want to reformat the drive but before try to repair it by going into the
analyze menu of format. Select the setup option to tell format only to test
blocks 1260000 to 1270000. Then, run the read test. It should find the
bad blocks and fix them.
___________________________________________________________________________
we had this problem too about half a year ago on a 4/260.
We reformatted the disk and had no problems since then.
But at the time before the problem came up we had rather high temperature
in our computer room. Now we have air-condition and we think that is
the reason why it works better now.
___________________________________________________________________________

Thanks to the numerous people who answered.

fabrice
fabrice@yosemite.atmos.ucla.edu



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:35 CDT