SUMMARY: StorEdge A1000 drive replacement

From: Josh Glover <jmglov_at_incogen.com>
Date: Fri Oct 18 2002 - 14:43:42 EDT
We had a drive fail in our StorEdge A1000 (running RAID 5), and my question
was, how do I replace it?

Thanks to Julie Firmin, who told me that it was a simple as pulling the
failed
drive and replacing it with the new one.

Thanks to Tony Walsh, who told me the same thing as Julie, but went into
excellent detail on some of the problems that might be encountered. His
advice
that helped me the most was:

"remove the 1,1 failed drive from the array, wait 10-15 seconds, and then
 replace with the new drive. You should then wait approx 30 seconds before
 you physically do anything else with the array (ie. don't remove the new
 drive before it has had a chance to spin up and be integrated into the
 array). You need to wait this long for the "dacstore" area on the drive to
be
 updated with the existing array configuration. You need to do this operation
 with power still applied to the array so that the current DAC information
 is applied to the new drive."


He also explained how I could do this using the RAID Manager 6 GUI, but I do
not believe in X11 on servers, so that was not an option for me. (In case
someone out there is reading this message in the list archives and would
prefer
to use the GUI, the long and short of it is, start the GUI and use the
Recovery option, which walks you through the process, Wizard-style.)

After the replacement, he continues:

"As a result of either of these actions, you should see the LEDs for the
drives
 in 2,5 (my hot spare) and 1,1 (my failed drive) flashing fairly constantly
for
 some time (2-3 hours or longer is quite possible for a 36GB drive). The
 process happening at this point is the hot spare is being released by the
 process of copying all the data on 2,5 back to 1,1. (FYI You could still
have
 lost one more drive in this configuration without losing any data as the
 RAID 5 layout will run in a degraded mode without having a hot spare to swap
 to and the data will remain good)."


Finally, he suggests applying some patches (which I had already done prior to
this issue):

"As a further recommendation (after you have fixed this problem), I would
 advise you to upgrade you RM6 version to 6.22.1 with the appropriate patch
 112126-05 (for Solaris 8 or 9) or 112125--04 (for Solaris 2.6 or 7). When
you
 do this, make sure you perform the firmware flash upgrade and the NVSRAM
 upgrade on the array as soon as you can (Use the RM6 gui for the best
 results). The NVSRAM upgrade file is called "sie3240c.dl" and should be
found
 in /usr/lib/osa/fw/ after RM6.22.1 has been installed."


With Tony and Julie's great advise, the replacement went off without a hitch.
Thanks, guys!


My original message follows:
-----------------------------------------------------------------------------
--

Our StorEdge A1000 recently lost a drive. Luckily, we had set it up to use a
hot spare, and the spare took over, allowing the RAID to stay up and
functioning.

Sun is sending a replacement drive, which should be here in a day or two. My
question is, when said drive arrives, what is involved with replacing it?

The A1000 has hot-swappable SCSI drives, so we can definitely physically
replace the bad drive while the array is up. From what I am reading, once the
new drive is in there, we just need to unfail the drive (unless that is
automatic?), and the array should reconstruct the data.

We are using RAID Manager 6.22, and here is the output of drivutil -i
fd026_00:


Drive Information for fd026_002


Location  Capacity   Status         Vendor  Product          Firmware
            (MB)                              ID             Version
[1,0]     34732      Optimal        FUJITSU MAN3367M SUN36G  1502
[2,0]     34732      Optimal        FUJITSU MAN3367M SUN36G  1502
[1,1]     34732      Failed         FUJITSU MAN3367M SUN36G  1502
[2,1]     34732      Optimal        FUJITSU MAN3367M SUN36G  1502
[1,2]     34732      Optimal        FUJITSU MAN3367M SUN36G  1502
[2,2]     34732      Optimal        FUJITSU MAN3367M SUN36G  1502
[1,3]     34732      Optimal        FUJITSU MAN3367M SUN36G  1502
[2,3]     34732      Optimal        FUJITSU MAN3367M SUN36G  1502
[1,4]     34732      Optimal        FUJITSU MAN3367M SUN36G  1502
[2,4]     34732      Optimal        FUJITSU MAN3367M SUN36G  1502
[1,5]     34732      Optimal        FUJITSU MAN3367M SUN36G  1502
[2,5]     34732      Spare[1,1]     FUJITSU MAN3367M SUN36G  1502

drivutil succeeded!


If I am reading the man page right, all we should have to do after replacing
the failed drive ([1,1]) is to run the command:

drivutil -U 11 fd026_002

This should, according to the man page, unfail the drive and reconstruct the
data (we are running this as RAID 5).

If anyone has done this before, I would appreciate some feedback. Please do
tell me if I need to take the array offline, backup data, anything like that.


--
Josh Glover <jmglov@incogen.com>

Associate Systems Administrator
INCOGEN, Inc.
http://www.incogen.com/

GPG keyID 0x62386967 (7479 1A7A 46E6 041D 67AE  2546 A867 DBB1 6238 6967)
gpg --keyserver pgp.mit.edu --recv-keys 62386967

[demime 0.99c.7 removed an attachment of type application/pgp-signature]
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Fri Oct 18 14:44:52 2002

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:56 EST