SUMMARY:Filesystem-Partition Strange Behavior

From: Nikos Manousakis (nmanou@leon.nrcps.ariadne-t.gr)
Date: Fri May 05 1995 - 01:47:01 CDT


Hi Sun Managers,

I apologize for taking me so long to summarize, but after solving
my problem I was too busy and then I had my Easter vacation.
Anyway, here it is now.

MY ORIGINAL REQUEST:
====================

>Hello there,
>
>I am using the following machine configuration:
>
>#uname -a
>SunOS naxos 4.1.3_U1 3 sun4m
>#dmesg
>...
>sd0 at esp0 target 3 lun 0
>sd0: <SUN0535 cyl 1866 alt 2 hd 7 sec 80>
>sd1 at esp0 target 1 lun 0
>sd1: <SEAGATE ST31200N cyl 2724 alt 2 hd 9 sec 84>
>...
>
>Two days ago (Mar 29) I did ls -lg on root directory and among others I got:
>
>#ls -lg
>...
>drwxr-xr-x 11 root wheel 512 Mar 30 18:33 usr1/
>-r-S-----T18516 21364 29285 1634956090 Mar 29 2029 usr2
>...
>
>The /usr1 and /usr2 were originally designed to be two filesystems mounted
>from partitions /dev/sd1a and /dev/sd1b respectively from my external hard
>disk (supplied by Artecon, type: Seagate ST31200N, capacity: 2061108 blocks
>(512 Byte), i.e. 1.055GB).
>
>Examining the ls output you can see that usr2 is no longer a directory, rather
>it has been transformed to a file. Thus, my subdirectories there are
>inaccessible. Moreover, the size of the file is 1.6GB and any attempt to
>read any portion of it returns panick messages and results in system
>rebooting. I have already checked all the daemon-produced messages and
>nothing strange has been recorded in their files. Only the audit daemon
>returns a message because it can't find the dir entry in the audit_control
>file. This is logical since /etc/security/audit is linked to /usr2/audit.
>
>So, I have two questions:
>1) Why this happened?
>2) How can I fix it?
>
>Regarding the first question, I include some output I got running several
>commands (my comments are in parentheses):
>
>#ls -lg (the bad situation)
>...
>drwxr-xr-x 11 root wheel 512 Mar 30 18:33 usr1/
>-r-S-----T18516 21364 29285 1634956090 Mar 29 2029 usr2
>...
>
>#df (the numbers of /usr2 are OK!!)
>Filesystem kbytes used avail capacity Mounted on
>/dev/sd0a 13423 8879 3202 73% /
>/dev/sd0g 320582 218291 70233 76% /usr
>/dev/sd0h 93503 44344 39809 53% /home
>/dev/sd1a 482782 25841 408663 6% /usr1
>/dev/sd1b 483926 153878 281656 35% /usr2
>
>#umount /usr2
>#mkdir /usr3
>#ls -lg (usr2 as a directory)
>...
>drwxr-xr-x 11 root wheel 512 Mar 30 18:33 usr1/
>drwxr-xr-x 2 root daemon 512 Feb 20 11:54 usr2/
>drwxr-xr-x 2 root daemon 512 Mar 30 14:00 usr3/
>...
>#mount /dev/sd1b /usr3 (try another directory-work with usr3 from here onwards)
>#ls -lg (the same bad situation)
>...
>drwxr-xr-x 11 root wheel 512 Mar 30 18:33 usr1/
>drwxr-xr-x 2 root daemon 512 Feb 20 11:54 usr2/
>-r-S-----T18516 21364 29285 1634956090 Mar 29 2029 usr3
>...
>
>#df (/usr3 looks OK!)
>Filesystem kbytes used avail capacity Mounted on
>/dev/sd0a 13423 8888 3193 74% /
>/dev/sd0g 320582 218291 70233 76% /usr
>/dev/sd0h 93503 44344 39809 53% /home
>/dev/sd1a 482782 25841 408663 6% /usr1
>/dev/sd1b 483926 153878 281656 35% /usr3
>
>#chmod 666 usr3 (this doesn't reboot the system)
>#ls -lg
>...
>drwxr-xr-x 11 root wheel 512 Mar 31 17:00 usr1/
>drwxr-xr-x 2 root daemon 512 Feb 20 11:54 usr2/
>-rw-rw-rw-18516 21364 29285 1634956090 Mar 29 2029 usr3
>...
>
>#file usr3
>usr3: strange file with mode=30666
>
>#du -a usr3 (in kilobytes!!)
>2503866 usr3
>
>#tar cvf /home/usr3.tar usr3 (try to make an archive)
>tar: usr3 is not a file. Not dumped
>
>#dump 0f /usr1/usr3.dump /dev/sd1b
> DUMP: Date of this level 0 dump: Fri Mar 31 16:40:48 1995
> DUMP: Date of last level 0 dump: the epoch
> DUMP: Dumping /dev/sd1b to /usr1/usr3.dump
> DUMP: mapping (Pass I) [regular files]
> DUMP: mapping (Pass II) [directories]
> DUMP: estimated 487300478 blocks (237939.69MB) on 1067.23 tape(s).
> DUMP: dumping (Pass III) [directories]
> DUMP: DUMP: bread: lseek fails
> DUMP: (This should not happen)bread from /dev/sd1b [block -1867998568]: count=8192, got=0
> DUMP: (This should not happen)bread from /dev/sd1b [block -1732343652]: count=6144, got=0
> DUMP: bread: lseek fails
>(... The last four messages are repeatedly occuring ...)
> DUMP: NEEDS ATTENTION: Do you want to abort dump?: ("yes" or "no")
>(... until I give "yes" to the previous question ....)
> DUMP: The ENTIRE dump is aborted.
>(... and dump is aborted.)
>
>
>
>Regarding the second question, I haven't run fsck or format yet. So, please
>give any valuable suggestions to overcome my problem.
>
>Two notes before closing: a) the hard disk hasn't been removed or disconnected
>by any means, and b) no power failure has happened the last days (we have
>a UPS installed anyway)
>
>Thanks in advance any of you for your help and valuable cooperation,
>
>nmanou (nmanou@leon.nrcps.ariadne-t.gr)

MY THANKS GO TO:
================
gdonl@gv.ssi1.com (Don Lewis)
kevin@uniq.com.au (Kevin Sheehan)
strombrg@hydra.acs.uci.edu (Dan Stromberg)
wtilford@lgc.com (Tilford Wayne)
bader@nadc.nadc.navy.mil (K. Bader)

SUGGESTED SOLUTIONS:
====================
1.
gdonl@gv.ssi1.com

Don suggested to ''Reboot single user, run fsck on /dev/sd1b and pray''.
He added also:
''You might dd /dev/rsd1b off to tape as insurance.
        dd if=/dev/rsd1b of=/dev/rstwhatever bs=56k
It looks like the root inode on sd1b got corrupted somehow. Ordinarily
when a directory is corrupted like this, fsck will stick its contents in
the lost+found directory, but since the root directory is corrupted, the
lost+found directory isn't available, so I'm not sure what fsck will do.
What you want it to do is to nuke the inode for the root directory,
re-create the root directory and lost+found, then stick all the files
and directories from the old root directory back in lost+found. If you
can get to that point, you should be able to decipher the contents of
lost+found and mv them back to their proper names under the root of the
filesystem.

If fsck trashes the disk, then you can run dd in the reverse direction
to restore it and try again. The next strategy would be to attack the
disk with a binary editor and create an empty root directory before
running fsck to re-connect the rest of the filesystem.''

2.
kevin@uniq.com.au
 
Kevin thought that it ''Looks like the inode entry for that got corrupted.
I'd say run an fsck and expect some unknown file types..''

3.
strombrg@hydra.acs.uci.edu

Dan suggested the following:
''Try backing up the filesystem with a sun or gnu tar or cpio command,
that skips your erie file.
Then try fsck.
You may have a bad block, something may have written to the
filesystem's device file mistakenly, or there may (off chance) be a
filesystem bug that caused it.''

4.
wtilford@lgc.com

Wayne responded the following:
''Nikos,
its time to get out the backup tape my friend. Try to run fsck.
I think that you are going to make a new filesystem on that partition
and reload the data. That may be your only fix for this. But you might
try to use fsck -b and try to use a backup superblock, although I
have never had this help me yet. Use format and see if using the
backup disk label helps. As far as to why it happened, I can't tell
from the info you gave, but its clear that the filesystem is corrupted''

5.
bader@nadc.nadc.navy.mil

Kim mentioned the following (which I had checked at first place, but
thanks anyway):
''The file mode looks like swap space. Are you sure you are not swapping on
it anywhere. It's just a thought. Someone else may be able to help you
better. Good Luck.''

CONCLUSION
==========
As you can read everyone suggested running an fsck. So, that was
exactly what I did. After typing a lot of "Yes" answers to questions
regarding repairing, I had my filesystem back to its original state:
its size was OK, and no loss of data or misbehavior has been observed
since then. As to what caused the problem, maybe in the course of my
learning system adminastrating I would find the reason.

Thanks again everyone who helped and sorry for letting you without
a summary for that long.

nmanou@leon.nrcps.ariadne-t.gr (Nikos Manoussakis)



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:23 CDT