SUMMARY: Mysterious 4/470 crash

From: Danielle Druery (fmrco!moby!dani@uunet.uu.net)
Date: Fri Feb 15 1991 - 16:45:49 CST


My sincere gratitude to all who responded. We are back
up and running 24hours/day without any crashes and the management
is happy about Suns again.
 
I apologize for the delay in summarizing but I wanted
to wait until enough time had passed to make sure the fix we installed
was truly successful.

Hal Stern hit the nail on the head. Since we are running intense
Sybase applications locally on the Sun via telnet we encountered
a bug (bugid #1039406/1046009) with the TCP/IP loopback code. We installed
the patch (sun patch 100159-01) and the mysterious crash went away.
Actually, the crash started to occur almost daily whenever the users
were producing a ton of reports using Sybase. This led us to believe
that perhaps the software might be causing the problem instead of hardware as
we originally suspected (particularly since we pretty much replaced all
the hardware except the casing!).

I've decided to include all the responses in full (even tho it makes this
message real long) just in case some other
poor soul has a mysterious problem and finds the advice applicable.

Again, thanks to all. And for those of you that were concerned about
your investments at Fidelity - don`t worry... everything is safe and sound!

Danielle Druery => fmrco!moby!dani@uunet.uu.net
Fidelity Investments- ZQ1
82 Devonshire St.
Boston, MA 02109
(617) 439-1854

======================================================================
================== ORIGINAL REQUEST ==================================

>Greetings sun-managers.....
>
        Here's a problem I hope you can all sink your teeth into !!!!
        This is an URGENT request for help.......

System: Sun 4/470
O/S: SunOS 4.1 PSR_A (no patches)
Specs:
        - Sun 4400 CPU board
        - Sun 32-Mbyte ECC Memory Board
        - Sun-3 SCSI Controller
        - ISP-80 Disk Controller
        - 2 * 1.2GB CDC IPI disks
        - 1 External 8mm Exabyte Tape Drive

Major Software Application: Sybase SQL Server, users telnet from PCs

The systems are located in a controlled regulation computer room.

THE PROBLEM:
-----------

In recent months we have experienced mysterious intermittent failures
on two of our (newly installed) Sun 4/470 servers.
Apparently, the systems get into a "hung" state such that any
current processes or attempted logins are halted. The monitor freezes,
the keyboard is ignored (L1-A doesn't even work), and all perfmeters
display "RIP". "Ping" is the only successful command. The CPU LEDs cycle
very slowly and the CPU light on the memory board is solid (instead of
blinking as usual). We've been able to recover the systems only by hitting
the RESET switch on the CPU board. We once waited about an hour to see
if the system would come back to life on it's own (it didn't) but mostly we've
had to reset immediately for our production applications.

This situation has occurred at least six times in the last
few months. There is no established pattern nor can we reproduce
the "hung state" at will. There are no messages left in /var/adm/messages
nor is the core dumped. In short, it is a complete mystery !!!!

ACTION to DATE:
--------------
We are approaching this problem from every angle... hardware,
software, and network ....

        1. began a "not so scientific" tedious process of swapping out
        all the boards. Problem stopped occurring on one system after
        the last memory board was swapped but it still occurs on
        our PRODUCTION system after swapping memory and cpu boards.
        All boards have been reseated.

        2. attached lanalyzer to monitor any bizarre network traffic

        3. setup a power meter to determine if spikes (up or down)
        coincide with the failures even though the room is equipped with UPS

        4. run a script which logs the output of ps -auxww & ps -alxww
        every 30 seconds

Has anyone seen something like this before ???
I would greatly appreciate any diagnostic programs, procedures, tools,
insights, ideas, or any related information. I will summarize for the
net or anyone who requests.

=======================================================================
========================== RESPONSES ==================================
=======================================================================
From: uunet!East.Sun.COM!stern (Hal Stern - Consultant)

have you installed the patch for bugid #1039406/1046009 (sun patch 100159-01)?

there is a bug in the TCP/IP loopback code that can cause a machine to hang
with pretty much the symptoms you've described. sybase is one of the things
that can trigger it, although it has to be local sybase usage (maybe doing
a database dump or update on the server?)

in either case, a patch is available from sun (or from our local office).
you should give it a try and see if it cures the problems.

==========================================================================
From: uunet!bit!markm (Mark Morrissey)

I think you may be seeing a manifestation of the nfsd/biod thrashing
problem. The way to be sure is let the system stay hung for a long
time (I think we let ours go for about 2.5 hours before it came back).

Also, just to be safe, get the 4.1 NFS jumbo patch and see if things
don't get better. I am not sure if it is rolled into 4.1.1, so it
is hard to say if getting the jumbo patch or going to 4.1.1 is a
better step.

I am not positive that this will help, but it won't hurt to get those
patches into the system.

--mark
=========================================================================

From: uunet!cs.Buffalo.EDU!kensmith (Ken Smith)

My vote goes for finding your problem with the lanalyzer. I've seen
servers go into the state you described. At first I was mystified
too but finally found that when the server went into this state there
was a client machine somewhere on the network pounding the h*ll out
of the server with network traffic. Rebooting the client that was
pounding on the server made the server happy again. I later figured
out what the cause of the problem was : we switched an executable file
on the server and the client was running that executable. When the
client needs to page in a piece of that executable that it had allowed
to slip out of main memory (not copying it to swap space, assuming that
the executable's file would always be there while the executable was
running which is a valid assumption on non-NFS systems...) it sends
an NFS request to the server asking for that chunk of the executable
back. The server generates an error since the file isn't there any
more but the client doesn't accept that as an answer and makes the
request again (and again and again and ...).

I don't have any 4/470's around but I know a SparcStation can hammer on
a 3/280 so hard that the 3/280 is completely crippled. I don't know if
anything you could have as a client can hammer hard enough to cripple
a 4/470 that bad or not... Basically your server is spending all its
time in the kernel servicing the network traffic from the disgruntled
client and doesn't have any time left to do anything else...

I'm curious to know what the real problem is when you find it...

        ken
=====================================================================
From: uunet!ASC.SLB.COM!holle

Are you, by chance, running a screenblank or screensaver
program? I had the same problem when I had screenblank
running in /etc/rc.local.

-Kathy Holle

(EDITORS NOTE: yes, we were running screenblank out of rc.local but removed
it when we received this message.... i wish the solution
 could have been that easy!)

=======================================================================
From: uunet!ucsd.edu!mrwallen (Mark R. Wallen)

We had a similar problem with a 4/370 used for instruction.
Under heavy loads caused by LISP, the machine would wedge
and only a power cycle would get it to come back to life.
(We have an ordinary terminal as the console, and BREAK would
not get into the PROM monitor as it should (equivalent to L1-A).

After swapping virtually everything, we finally tried another CPU
of a higher revision and the problem went away. I believe that
the original CPU was something like rev 17 and the one that worked
was rev 24. (Don't hold me to the actual numbers, though
I can probably get them if you wish). The first CPU swap we
tried was about the same rev level as our failing one and it
too failed. It seemed to require a revision quite a bit higher
to fix the problem.

Mark Wallen

========================================================================
From: uunet!ica.philips.nl!geertj (Geert Jan de Groot)

Don't have the slightest idea yet why your 4/470's misbehave. Can you
tell me:
a. the average load;
b. do you do anything weird with adding swapspace? this can kill a machine
c. I'm really interested in the output by 'ps' etc just before a machine
   goes down.
d. Feed a sun 3/60 from the same power plug as your 4/470's.
   3/60 have a bad power supply and will go first if power trouble.
(EDITORS NOTE: is this like bringing a canary into the coal mine ?!)
e. Junk on your console port?
f. Start using kadb, then start vmunix (b id()kadb), and try to
   get to kadb when the machine dies
g. I don't like sunscreens for console. disconnect keyboard and screen,
   and hook up a vt100 on ttya. see if that reacts when the machine dies.
   you need to reset (k2) the CPU to recognize the terminal as console.
i. Have you added boards? Double check jumpers on the backplane!
j. Are you using NFS? is the machine serving clients?

Please keep me posted! Maybe I can think of something with more info.

Good luck,

Geert Jan

==========================================================================

From: "Scot Gardner" <uunet!sun1.ise.ufl.edu!scot>

I'm not a Sun genius, however, I have had the "mysterious" hang
problem on my suns for quite a while. I've been told it's a
NFS bug where the machine gets a process in disk wait and
hangs after a minute or so. I forgot the exact problem, but
I hear there is a patch for Sun OS 4.1. If you use NFS quite
a bit, ask SUN about this.
Scot Gardner
==========================================================================

From: uunet!wiau.medical-biophysics.manchester.ac.uk!rick

\when you stsem crashes and you reset you get the boot prompt ">"

then type g0 this will put the contents of memory into the swap space.

onec the system is up run "savecore" this will give you a vmunix.0 and
vmcore.0 files.

you can then use utilities like ps (with -k option) and adb (see adb users guide
 there is a sction on kernal crashes)

This will tell you what was going on when you reset the sytetm. and its a load
less effort then ps every 30 seconds.

If you find a software problem sun (or whoever) can take these files and
look at them more thorouglhy to determine the problem.

Hope this helps.

RICK DIPPER, Wolfson Image Analysis Unit, rick@uk.ac.man.mb.wiau Department of Medical Bio-Physics, University of Manchester 061-275-5158

======================================================================

From: uunet!edsr!jcn (Jim Niemann)

Are you sure you are not running the SUN DBE software.
Here is patch info that seems to describe your problem
very nicely.

Patch-ID# 100119-01
Keywords: asynch i/o system hangs SunDBE Sybase SQL
Synopsis: SunDBE patches.system hangs with more than 100 concurrent asynch calls

Date: 17-Oct-1990
 
SunOS RELEASE: 4.1 4.1PSR-A
 
Topic: SunDBE asynchronous i/o fixes
 
BugID's fixed with this patch: 1044422 1044424
 
Architectures for which this patch is available: sun4, sun4c, sun4/490
 
Obsolete By:
 
Problem Description:

        1. 1044422

        When more than 100 asynchronous i/o requests are outstanding, the
        system will hang. The problem will most likely occur when SunDBE is
        running with several Sybase SQL servers, although it can also occur
        with one SQL Server and one or more other applications that are also
        using asynch i/o. It can also happen when no Sybase SQL Servers are
        running, but several applications aer doing large numbers of
        simultaneous asynch i/o's. The symptoms are that the system hangs and
        the following message is printed to the console:

        SUNDBE: asynch i/o in use - this is not an error

        The problem is caused by the number of outstanding asynchronous i/o's
        being greater than a fixed limit (100).

        2. 1044424

        The Sybase SQL Server hangs with status D (when displayed with ps(1)
        and the SQL Server cannot be killed. This error occrus very rarely,
        on 4/260 and 4/280 machines, and only when the system is paging.
        The problem can be avoided by not using Unix files for the database
        and having only the SQL Server funning on the machine (this avoids
        paging which caused the problem).
 
INSTALL:

SunDBE 1.0 must be installed before installing this patch. Do not install this
patch if you are not using SunDBE 1.0. This patch is for the following
machines running a SunDBE kernel:

   patch machine
sub-directory

   sun4 Sun-4/260, Sun-4/280, SPARCserver 330, SPARCserver 370
   sun4c SPARCserver 1, SPARCserver 1+
 sun4_490 SPARCserver 390, SPARCserver 470, SPARCserver 490

As root do the following:

    1. copy patch files to appropriate directories

           # cd <to appropriate patch sub-directory -- see above>
           # cp dbe_asynch.h /usr/sys/sys
           # cp dbe_asynch.o /usr/sys/sun4{c}/OBJ

                   (where sun4{c} = sun4 or sun4c)

    2. A custom SunDBE kernel must be made and installed in order for the
       changes to take effect.

       make/remake and re/install a SunDBE custom kernel as explained in
       Chapter 3 of the Sun Database Excelerator Release Manual

=======================================================================

From: uunet!umiacs.UMD.EDU!steve (Steve D. Miller)

   Just because the system doesn't dump core on its own doesn't mean that
you can't make it dump core. Uncomment the savecore stuff in /etc/rc.local
(or rc.whatever, I can never keep straight which one has savecore in it,
and I don't have a Sun handy right now to check). Then check out the
PROM manuals and figure out how to stuff something nasty (0 or -1/ffffffff)
into the PC. I'd be very surprised if there isn't some way to do this
on a SPARC machine; I know that you can use the "r" PROM command on Sun-3s
for this purpose, 'cause I've done so many a time. I just haven't yet
needed to do this on a SPARCstation.

        -Steve

===========================================================================
From: uunet!megatek!felixw (Felix Wisgo CAE account)

Danielle/Rick--

        When you replaced memory and cpu boards, did you also
replace the SIMMs on the boards??? It seems to me, I had a problem
similar to yours a few years ago and it boiled down to a few bad
memory chips. The architecture was different (Sun-3s), but it
may be worthwhile to try swapping out memory chips. Whatever
the problem, I'd be interested in hearing your solution.

========================================================================

From: Doug Peterson <uunet!USAN.consult.com!doug>

Danielle -

I have seen a similar problem before, when I had intermixed SPARC's and
68020's. Two of the SPARCS were YP slave servers, and would take over the
YP queries (now DNS), and eventually hang, because they couldn't get updated information from the non-SPARC master.

Also, I've seen some discussion about the P-MEG problem severly imparing performance when installed RAM went above 16MB. However, I thought that this was fixed in 4.1

I doubt that your problem is electrical/electronic. The power supplies used in Sun's are excellent power conditioners.

=============================================================================

From: uunet!einstein.eds.com!wazir (Deborah Wazir)

A few months ago our server crashed due to a power failure, and wouldn't
boot. It would get to the file system checks and then exhibit the symptoms
described in Danielle's posting to Sun-Managers. We swapped out everything,
but the problem kept coming back.

Finally, we figured out what the problem was. When the server crashed, the
other machines on the network started sending packets to the server to see
if it was up yet. They did this because each machine on the net NFS-mounts
about 12 filesystems from the server, and the server NFS-mounts the /home
partition from each machine on the net. We have about 30 machines altogether.
So the longer the server was down, the more frequently the other machines
sent out their broadcasts. This is referred to as a "broadcast storm".
When the server tried to boot, it was fine until it got to the "ifconfig"
line. As soon as it initialized its ethernet interface, and gave it an
IP address, it was able to start answering the other machines, and that's what
was bogging it down. It was like all the machines were trying to re-establish
NFS connections at the same time, and the server was so busy trying to fend
them off that it never got to finish booting.

We noticed that if we disconnected the ethernet cable from the server, it would
finish booting. The lights on the back would speed up immediately.

What we ended up doing was halting all the machines on the network; we were
then able to boot our server easily. After that, we started up the other
machines one at a time.

Let me know if this helps you. If it turns out to be another problem, I'd
be interested in hearing about that, too.

Good luck --
Deborah S. Wazir
===========================================================================

>>>>> that's all for now..... -danielle



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:11 CDT