SUMMARY: SPARCstation LX no longer boots - bad magic number

From: Tilman Sommer soft (sommer@vsun02.ag01.Kodak.com)
Date: Tue Jan 31 1995 - 20:50:22 CST


Hi,

Many thanks for all the responses I got! We replaced the workstation
with another one in the meanwhile and somebody else is trying from time
to time (no more pressure) to repair the defect one. As it looks of
today, the disk was probably damaged upon switching the workstation off
without shutting it down before that. When I know about the real cause
of the problem, I'll let you know. I though it makes more sense to post
this SUMMARY instead of waiting any longer for the explanation to show
up finally.

The answers I got contained helpful suggestions and information what
to try to regain access to the disk (what failed so far in our case).
I have attached the answers I got, however I removed some lines in
most mails (like signatures, ...) so that this summary doesn't
become too lengthy.

Most suggestions are to boot from CD and then various methods to
try accessing the disk were mentioned.

Thanks again for the promptly help!
-Tilman

The problem reported was:
After a power-cycle of a hung workstation one of our isolated
running LX's does not boot anymore:

         /iommu/sbus/espdma@4,8400000/esp@4,8800000/sd@3,0
         Bad magic number
         Can't open disk label package
         Can't open boot device.

----------------------------------------------------------------------------
#### ###### Tilman C. Sommer, OI-P+E, Software&Integration Group (SIG)
### ######## Kodak AG, Breitwiesen, D-73347 Muehlhausen/Gruib., Germany
## ########## Mailcode: 5023, Building 213/OG
# ### KODAK ## Fon : ++49(7335)12-7677 Fax: x7766
## ########## KNET Fon: 631-7677 Fax: x7766 KMX: 631-7677
### ######## PROFS : (K2057)KOKAG or : (SOMMER)EKSMTP (Internet)
#### ###### Internet: sommer@vsun02.ag01.kodak.com
----------------------------------------------------------------------------

-------------------------------------------------------------------------------
>From: vuppala@cps.msu.edu (Vasu)
try removing the disk and shaking/tapping a bit.
i have run into this problem but always got out of it. it generally
happens if you power off the machine.
-------------------------------------------------------------------------------
>From: yamaguch@cqt.com (Bob Yamaguchi)
You may have corrupted /vmunix or /boot. If that is the case you can restore
them by first booting from cd-rom. From there, mount the disks / partition
and copy over the files. Here are the steps from the FAQ:

        1) mount /dev/rsd?a /mnt
        9) # additional root setup
                #
                cp /mnt/usr/kvm/stand/vmunix /mnt
                chmod go-x /mnt/vmunix
                cp /mnt/usr/kvm/stand/boot.sun4 /mnt/boot
        10) # run installboot
                # I suggest reading the man page, note paths below, take care
                # to run operating installboot binary, but specify device and
                # paths # for new boot disk. You can even do this from a Sun3.
                #
                /usr/kvm/mdec/installboot -ltv /mnt/boot \
                                 /mnt/usr/kvm/mdec/bootsd /dev/rsd?a
 
                # if you are making a sun4c boot disk and running on a 4 or 4m
                # machine:
                /usr/kvm/mdec/installboot -ltvh /mnt/boot \
                                /mnt/usr/kvm/mdec/bootsd /dev/rsd?a
 
                # if you are making a sun4 boot disk on a 4c machine, you need
                # to
                # use a 4c installboot, not the one on your sun4.
-------------------------------------------------------------------------------
>From: Greg Coleman <gtech@cnj.digex.net>
The best I can tell you is that is the exact message I
get if I try and boot off a device I hadn't ran 'installboot' on
yet.
Which one in show-devs? Was it one that you previously boot
off? I usually have two seperate roots. My backup root I made
with ufsdump | ufsrestore and then installboot. At the prom level
I run devalias to say .... ok> devalias disk5 /iommu/.......
then ok> setenv boot-device disk5 AND most important, do a ok> reset.
If I just run boot at that point it still tries to boot off whatever
boot-device was originally set to. I remember that message because
I would still get it after all that prom stuff. Once I remembered
to run installboot, everything was cool.
Anyway, the 2 things I can think of are,
1) you never ran installboot on your backup, or you did run it
and your bootblocks are simply trashed.
2) your backup is fine but when at the prom level you are still
somehow booting off the primary (possibly munged) root device.
Oh yeah, one other; are both devices off the same controller?
Maybe the disks are fine but the controller is hosed. boot can't
get out there to read the bootblocks????
[..]
-------------------------------------------------------------------------------
>From: bukys@cs.rochester.edu
Here's what I would do:

(1) do "probe-scsi" from the PROM Monitor to determine whether the
    drive is talking to the SCSI bus. If not -> dead controller on
    the drive (or cabling problem).
(2) if the drive reports itself present to the SCSI bus, verify that
    it actually spins up, by putting your hand on it during power-up.
    If not, it could be stiction, which means you either throw the drive
    away, or whack it on the side with a hammer to get it going (which
    may or may not destry the drive).
(3) Otherwise, perhaps it's just the data on the drive that got smashed.
    Start the system some other way (diskless boot, cd-rom, external disk)
    and see what the format program says when you to to re-format and
    re-label the drive.
-------------------------------------------------------------------------------
>From: miket@ice9.hq.ileaf.com (Michael Thibodeau x3716)
At first glance, the boot block is damaged.
boot it up off of a cdrom, and then run installboot to put a new bootblock onto
the root filesystem.
-------------------------------------------------------------------------------
>From: Gary.Richardson@proteon.com (Gary Richardson)
If it's Solaris 1.x, it could be something with the files vmunix or
boot, or the bootblock itself. Boot up under cdrom (miniunix) and
try to reinstall the bootblock. It's something like this:

- boot off cdrom
- mkdir /mnt
- mount /dev/sd0a /mnt
- cd /usr/mdec
- installboot /mnt/boot bootsd /dev/rsd0a

Then try and reboot. It might be that you'll have to restore the vmunix
and/or boot files. If you restore the boot file, you'll need to do
the procedure above.

If it's Solaris 2.x, then I'm not sure what it could be. I haven't
been overly exposed to that stuff yet.
[..]
-------------------------------------------------------------------------------
>From: "David Beary" <"beary david"@a1.meoc02.sno.mts.dec.com>
This sounds like the classic 'boot-block' failure. You can often get this error
when the boot block which loads the kernel is corrupt. The best way to find out
would be to boot the O.S. from CDROM and re-install the boot-block (installboot
with SunOS 4.1.x - not sure about SunOS 5.x).
Then try rebooting from disk and see what happens...
The superblock may be corrupt - to correct this you can boot from CDROM,
enter 'format', then run 'backup superblock' from the format menu.
-------------------------------------------------------------------------------
>From: ken@cpatl.com (0000-Admin(0000))
You have lost the information on the disk that describes itself. If you know
the disk type and partitioning as it existed ( exactly ) before the crash
you can go into format and supply this information. Then fsck the raw device
for each partition on the disk.
To get to the disk boot off the network or off a CD.
-------------------------------------------------------------------------------
>From: zshouben@PCS.CNU.EDU (Shouben Zhou)
        1) Boot miniroot from cdrom or tape.
        2) Type format command, if you can see that disk, you
        are halfway lucky.
        3) Use fsck to check all the file systems in that disk.
        If no error has been reported, go ahead install bootback
        into your / partition. Tou are all set and reboot your system!
otherwise let me know what kind of error message you have got.
-------------------------------------------------------------------------------
>From: "Michael (M.A.) Meystel" <MEYSTMA@DUVM.BITNET>
Could be the boot block. run installboot (the proper procedure is in
the manual) and see if that doesn't fix it.

Boot from cdrom and mount some partition(s) from the disk to see if
data is still there. Run fsck on the disk.
-------------------------------------------------------------------------------
>From: jim@biz.trib.com (Jim Overeem 571)
if you try to do a probe -scsi do you get a good revision number?
if you get something like rev 4.01r the controller is toasted.
most times you can power down and reboot the system will fdisk
itself. or try booting single user .
-------------------------------------------------------------------------------
>From: Dan Stromberg - OAC-CSG <strombrg@hydra.acs.uci.edu>
It might be dead - but you should try running installboot on the disk,
after booting from some external media.

...that is, you should do that, if "probe-scsi" at the prom sees the
disk. If not, it's probably toast.
-------------------------------------------------------------------------------
>From: paulo@dcc.unicamp.br (Paulo Licio de Geus)
Booting from CDROM and running installboot might help. Also, if you
haven't power-cycled the machine, do a reset at the monitor prompt.
-------------------------------------------------------------------------------
>From: Markus Storm <storm@uni-paderborn.de>
| /iommu/sbus/espdma@4,8400000/esp@4,8800000/sd@3,0
| Bad magic number
| Can't open disk label package
| Can't open boot device.

[..translated into English, original was in German..]

Above message is also displayed if you dump a disk but don't explicitely install
the boot block. The crash might have been caused the corruption of the boot block?

One can recreate the boot block by using "installboot".
SunOS 4.1.X: /usr/kvm/mdec/installboot. See "man installboot(8S)".
Solaris 2.X: /usr/sbin/installboot. See "man installboot(1M)".
-------------------------------------------------------------------------------
>From: Ruben Ruiz <rruiz@amadeus.spin.com.mx>
    Maybe you should give it a try to use another superblock for checking
your disk. I have had those kind of problems in the past and i have
solved them by doing the following:

   # newfs -Nv /dev/rdsk/your-disk
  
   This won't create a new filesystem, since you are giving the -N
option. It will only tell you the parameters of your disk as if you were
going to create a fileystem. You have to write down 2 or 3 of the
superblocks that this command will tell you that are in your filesystem.
Then, you should use fsck telling it to use one of those superblocks. It
seems like the first superblock you have is wrong, so give it a try with
the second or third. The fsck command should be as follows:

  # fsck -F ufs -o b=superblock-number /dev/rdsk/your-disk

   This works in Solaris 2.3 I am not shure if it works in SunOS 4.x
Please take a look at the manual pages for newfs and fsck. Good Luck!
Ruben Ruiz, Inttelmex
-------------------------------------------------------------------------------
>From: johnm@sse.com (John Malick)
Try to boot from the cdrom into single user mode and run installboot. If that
does not resolve the problem try to copy a new boot file to the root directory.

This means that while booted from single -user mode on the cd that you can
at least mount the root and/or user partitions.
-------------------------------------------------------------------------------
>From: Jamie Grant <gw3874@nomura.co.uk>
The answer to this problem is that the disk label has been corrupted,
to get arround this problem, run format from cdrom. Select the drive
and then select backup this will search for a back-up disk label. If
this dont work, then you will have to type select and choose the
partition map for your disk, but beware ONLY do this as a LAST resort
and after trying the backup command, as it will DESTROY data on the
disk.
-------------------------------------------------------------------------------
>From: "Henry Katz" <hkatz@lehman.com>
Have you tried boot up the LX off a local CDROM and mounting the local file
systems and rewriting the disk label with the format program? or even fsck'ing
the local file systems in this situation?
-------------------------------------------------------------------------------
>From: Dave Woodruff <dcw@WLV.IIPO.GTEGSC.COM>
The only straightforward way to deal with this is to attach the
CDROM drive, put in the system CD and boot from that. Then the boot
sequence will tell you if it can find any response from the old disk's
SCSI target id and what it sees there. Depending on that result, you
can then use the <format> program off the CD to either re-label or
re-format the disk. If re-labelling seems to be called for, remember
to look for back-up labels on the disk in case the actual labelling
did not match the default from the format.dat on the CD!
     (If this was all already obvious to you, kindly ignore it; some of
the posters here seem to feel that <format> is entirely magic!
-------------------------------------------------------------------------------
>From: yves@suntech.abcomp.be (Yves Hardy)
I already encountered the same problem with Solaris 2.3. In fact,
when system is booting, it gives this message and the boot fails :

bad magic number on disk label
..............................

Solution :
--------

          Load and boot Munix and go into format. Use the "backup" menu option
to search for a backup label. If one is found, it restores the label. If one is
not found, verify the partition map and the defect information and then relabel
the disk again.
It is vital that before labeling, the partition map and defect list is correct.
After that, you can reboot your system and I think it will works.

Solution 2 :
-------- -

           If you made a backup of the root and /usr partitions, you can restore
them via the cdrom and then create a new boot block like this :

ok boot cdrom
.......
.......
.......
<Exit Install>

# newfs /dev/rdsk/c0t3d0s0 <-- disk target 3 (intern disk)
# fsck /dev/rdsk/c0t3d0s0
# mount /dev/dsk/c0t3d0s0 /a
# cd /a
# ufsrestore rvf /dev/rmt/0
# rm restoresymtable
# cd /
# umount /a
# fsck /dev/rdsk/c0t3d0s0

Re-install the boot block :

# cd /usr/lib/fs/ufs
# installboot bootblk /dev/rdsk/c0t3d0s0
# reboot
-------------------------------------------------------------------------------
>From: clive@inteleq.com (Clive Beddall )
Try to have a look at this disk from cdrom...miniroot etc.
There is magic there as you can create a new bootable partition on that very
disk...or not. Plus you can run format from cdrom to see if labels are intact
etc.
-------------------------------------------------------------------------------
END OF SUMMARY



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:15 CDT