SUMMARY: top hangs and can't be killed -9

From: Christian Haul (haul@dvs1.informatik.tu-darmstadt.de)
Date: Thu Mar 23 2000 - 12:14:41 CST


Original Question:
> Hi sun-managers,
>
> Yesterday I changed two disks on one of our servers (SS10MP, Solaris
> 2.5.1) because on was frequently reporting scsi errors (the disk
> contains /,/usr,swap,/opt,/cache) and another one needing massive
> physical help (punches to the disk case) to spin up (containing swap
> and user files) . Anyway I copied the disks beforehand using tar cpf -
> | tar xpf -.
>
> Now some strange behaviour occurs: When running "top", the command
> hang and cannot be terminated by ^C, kill, or kill -9. Apart from that
> the system runs as usal but some other programs (e.g. netscape
> enterprise 3.5x) also stop during startup but can be terminated by
> ^C. Interestingly enough, an Informix database server doesn't seem to
> be affected.
>
> Swapping the old boot disk back in doesn't help. This leads me to the
> conclusion that the problem cause existed before the disk swapping and
> submerged only because of the reboot. Unfortunately, I am not aware
> that anything has changed on that particular machine and I am clueless
> what to look for.
>
> For the initiated amog you these are the last lines of an truss -f top
>
> 1262: open("/usr/lib/libc.so.1", O_RDONLY) = 4
> 1262: fstat(4, 0xEFFFF21C) = 0
> 1262: mmap(0xEF7B0000, 4096, PROT_READ|PROT_EXEC, MAP_SHARED|MAP_FIXED, 4, 0) = 0xEF7B0000
> 1262: mmap(0x00000000, 622592, PROT_READ|PROT_EXEC, MAP_PRIVATE, 4, 0) = 0xEF680000
> 1262: munmap(0xEF6FF000, 61440) = 0
> 1262: mmap(0xEF70E000, 28448, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 4, 516096) = 0xEF70E000
> 1262: mmap(0xEF715000, 8552, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xEF715000
> 1262: close(4) = 0
> 1262: open("/usr/dt/lib/libdl.so.1", O_RDONLY) Err#2 ENOENT
> 1262: open("/tools/pd/wien/x11r6/lib/libdl.so.1", O_RDONLY) Err#2 ENOENT
> 1262: open("/tools/com/paris/objectstore/sunpro/lib/libdl.so.1", O_RDONLY) Err#2 ENOENT
> 1262: open("/usr/lib/libdl.so.1", O_RDONLY) = 4
> 1262: fstat(4, 0xEFFFF21C) = 0
> 1262: mmap(0xEF7B0000, 4096, PROT_READ|PROT_EXEC, MAP_SHARED|MAP_FIXED, 4, 0) = 0xEF7B0000
> 1262: close(4) = 0
> 1262: open("/usr/platform/SUNW,SPARCstation-10/lib/libc_psr.so.1", O_RDONLY) Err#2 ENOENT
> 1262: close(3) = 0
> 1262: open("/usr/platform/SUNW,SPARCstation-10/lib/libkvm_psr.so.1", O_RDONLY) Err#2 ENOENT
> 1262: brk(0x000325B0) = 0
> 1262: brk(0x000345B0)
>
> TIA,
>
> Christian Haul
>

I got replys from
  Gabriel Rosenkoetter
  Jarrett Carver
  Leif H Ericksen
  Russ Poffenberger

which were also so kind to discuss the topic in private mail. Thanks a
lot.

After much time spent on trial-and-error with this problem I wasn't
able to solve the problem. One reply suggested that tar might have
been the cause but the problem occurs even with an (ufs-) restored
copy from tape. This copy is believed to reflect a good system state.

To make this issue even more interesting I found today that the
problems vanish about 15 minutes after reboot. This is not really good
news but I might be able to live with it.

I'd like to comment on one article I found on sun-managers on a
remotely related issue: It was stated that it is not possible to
change the scsi id of the boot disk without reinstalling the system. I
found that provided the devices for new scsi id where created before
the change and vfstab entries are changed beforehand it does work. At
least it did for me with Solaris 2.5.1 I should add that it did not
work for me to create the missing devices with
"drvconfig -r /path/to/mounted/boot/disk"
"devlinks -r /path/to/mounted/boot/disk"
"disks -r /path/to/mounted/boot/disk"
from another system. Using (cd /; tar cpf - devices dev) | (cd
/path/to/mounted/boot/disk/; tar xpf -) did work.

        Christian



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:14:05 CDT