Summary : system keeps hanging once in a while ....

From: Kam Tim Chan (tim@prelude.tcs.com)
Date: Thu Dec 19 1991 - 17:58:27 CST


Hi everyone,

        I am sorry to have this summary so late, but it's the same old story ...

Back in Oct/Nov, I have asked for assistant on the hanging problem :

>Hi everyone,
>
> I have a SS2, w/ 40M RAM, 1 SBus prestoserve board, 1 internal
>Fujitsu 520MB disk, 3 external Maxtor P-12S 1.2G disk drives. Lately
>for these 2 weeks or more, we have been experiencing system hanging problem.
>Occasionally, the system will hang and not responding to nothing, the
>only way out is to reboot. It's running SunOS 4.1.1b.
> Therefore, I go to the "new" command mode, and hit "sync" trying
>to sync the disk and reboot, every time it'll say "give up" after display
>a whole bunch of same numbers. And then at the end of the dumping
>process, it'll give me a couple of console messages :
>
>sd1: SCSI transport failed: reason 'reset': retrying command
>sd2: SCSI transport failed: reason 'reset': retrying command
>sd3: SCSI transport failed: reason 'reset': retrying command
>esp0: Target 0 now Synchronous at 4.0 mb/s max transmit rate
>esp0: Target 1 now Synchronous at 4.0 mb/s max transmit rate
>esp0: Target 2 now Synchronous at 4.0 mb/s max transmit rate
>
>so, I thought that it must be the SCSI bus, since we added "sd1" 2 weeks
>ago, and we may have a bad cable or bad drive, or may be even a defective
>SCSI host adaptor. Therefore, I replaced the CPU, the internal disk,
>and "sd1" with brand new cable. By the way, the external SCSI cables are
>6' + 2' + 2', so I don't think cable length is a problem. The hard part is
>that everytime it reboots, it'll come back fine and no problem, and it seems
>to hang at random. It is basically a file server, although it also has
>sybase" and "adc" (configuration management software) running, but that's
>about it.
>
>I've saved the core file and did some "ps akx vmunix.1 vmcore.1" to see
>what's running at the time system hangs and almost everytime when it hangs
>swapper and pagedaemon are both "runable" and all nfsd are either in
>"disk wait" or "runable" states .... :
>

Thanks to all the responses.

Some of you have suggested to check the cable length and termination,
which I did and actually replaced all cables and terminators before I
sent out that help message, but I did check again anyway, but that's
not the problem. And then some of you suggest that there is a tcp/ip
loopback bug when using "sybase", which is patch id 100159-01 and the
README Synopsis is :

SunOS 4.1, 4.1.1:system hangs using sockets in local loopback tcp-ip

so I thought, way to go, and installed the patch, unfortunately the
problem didn't go away. In addition, we have installed some other
patches I've received from Sun's Tech support, which are all related
to SCSI problems ..., but all patches didn't help.

Then I talked to our vendor AnDataCo again, they mentioned that they
have some clients having similar problems, which is running Sybase
on that kind of disk, which is the newer Maxtor Panther P-12S with
revision JB21. So, I moved the entire Sybase stuffs over to another
server. The problem seems to go away !!! although I still receive
some "data transfer overrun" warnings time to time, but the system
didn't hang. And then AnDataCo told me to turn off the read-ahead
cache on those Maxtor disks, which I did and they system is as
healthy as it can be now :-), well, at least no warning no hangs :-).

So my conclusion is that an enabled read-ahead cache on a Maxtor Panther
P-12S will cost data transfer overrun, and then if Sybase is running, it's
processes (dataserver may be) was not about to recover from the SCSI
error and blocking others to access the disk ... or something like
that. So make sure you turn off the read-ahead cache, and they said
they don't think it's gonna cost much performance problem.

Thanks again for all of your responses ....

                                                Tim Chan
================================ Address ===============================
Tim Chan, System Engineer, Teknekron Communications Systems
        (510)-649-3645 2121, Allston Way, Berkeley, CA94704
Internet : tim@tcs.com uucp : uunet!tcs!tim



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:21 CDT