SUMMARY: 4/280 hangs inexplicably

From: Andrew Bohonis (andrewb@cs.adelaide.edu.au)
Date: Mon May 27 1991 - 18:53:30 CDT

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

THE ORIGINAL PROBLEM:

> We have a sun 4/280 running 4.1.1 (unpatched) which becomes catatonic
> every one or two days. When this happens, the only thing the machine
> responds to is a break at the console. We have produced a crash dump
> which shows 40 or more processes running when examined with ps.
>
> I seem to recall a similar problem being described in this group recently,
> unfortunately I didn't save it. If there is a summary for this problem,
> I would appreciate it if someone would mail it to me.
Just to provide some more background the machine has 5 diskless clients
(3/50s and 3/60s) and in addition is an NFS server to a couple of
SParcStation1's and IPC's.

As it turns out I had imagined the summary or confused it with another,
similar problem. Quite a few people requested a copy of this summary, so
I thought I'd post this one after making sure the installed fix really worked.

THE SOLUTION:

         The solution (from Sven Ole Skrivervik <svenole@sdata.no>)
         was to install the rpc.lockd jumbo patch (patchid# 100075-06)
         which is available from princeton.edu:/pub/sun-fixes/sunos4.1.1
         The machine that used to toss its cookies about once a day has now
         been running healthily for about a week.

THANKS
to all the helpful people who replied. I have appended the messages to
this summary since despite the fact it makes a long posting they may contain
useful advice for others experiencing a similar mysterious problem.

A FEW OTHER SUGGESTIONS:

antonson@software.org (Todd S. Antonson) suggested to check if the
partitioning of the disks was correct.

jnapier@UCSD.EDU (Jim Napier)
and kwthomas@nsslsun.gcn.uoknor.edu (Kevin W. Thomas)
independently mentioned it might be a hardware specific problem.
However in these cases the computer would not even respond to a break or
L1-A at the console.

fmrco!moby!dani@murtoa.cs.mu.oz@uunet.uu.net (Danielle Druery)
told of a similar problem she encountered with a 4/470 (4.1) that was
cured by installing the tcp/ip loopback patch (patchid# 100159-01). She
also kindly forwarded the summary of responses to her problem.

-- 
Andrew Bohonis       
Department of Computer Science
University of Adelaide        
andrewb@cs.adelaide.edu.au
THE RESPONSES
From: antonson@software.org (Todd S. Antonson)
I don't know if this will help or not, but we have
a sun4/280 running 4.0.3 and it used to hang a lot.
I reformatted a disk that seemed to always have a
spot on it that fsck couldn't handle and it hasn't
hung since.  Check you disks to make sure the
partitioning is correct.
Good luck,   I'm looking forward to a summary as
I will be upgrading to 4.1.1 shortly.
--
=======================================================================
Todd S. Antonson                    |
Software Productivity Consortium    |UUCP : ..!uunet!software!antonson
SPC Building -- 2214 Rock Hill Road |CSNET: antonson@software.org
Herndon, Virginia 22070             |
=======================================================================
From: stern@sunne.East.Sun.COM (Hal Stern - Consultant)
what is the machine doing?  running as an NFS server?  what
are the processes doing?  what processes are they?  can
you send output of crash/ps to us?
--hal
From: jnapier@UCSD.EDU (Jim Napier)
Andrew-
I haven't seen any trouble with 4/280's running 4.1.1. We have one that's
been humming along like a champ without even so much as a hiccup since we
installed 4.1.1 about 2 months ago. I know there are other 4/280s on our
campus running 4.1.1 also without problems. It sounds like you have some
hardware specific problem (CPU going bad??). A little more detail on your
environment would help this discussion (e.g. does it serve a lot of diskless
clients, etc.). I don't remember anything like this being discussed recently
in sun-managers but if you get a copy of a summary on this topic I'd like
one also so I can be prepared. Thanks.
Jim Napier
Programmer/Analyst
Applied Mechanics & Engineering Sciences Dept.
U.C. San Diego
(619)534-5414
napier@ames.ucsd.edu, jnapier@ucsd.edu, ...!ucsd!jnapier
From: fmrco!moby!dani@murtoa.cs.mu.oz@uunet.uu.net (Danielle Druery)
Hi Andrew....
I had a similar "hanging" problem with a 4/470 (4.1).  I've enclosed
the summary which lists all the ideas people sent.  Perhaps one will help.
Our problem was solved by installing the tcp/ip loopback patch.
Hope this helps!!!
-danielle
***********************************************************************
To: sun-managers@eecs.nwu.edu
Subject: SUMMARY: 4/470 Mysterious Crash
My sincere gratitude to all who responded.  We are back
up and running 24hours/day without any crashes and the management
is happy about Suns again.
I apologize for the delay in summarizing but I wanted
to wait until enough time had passed to make sure the fix we installed
was truly successful.
Hal Stern hit the nail on the head.  Since we are running intense
Sybase applications locally on the Sun via telnet we encountered
a bug (bugid #1039406/1046009) with the TCP/IP loopback code.  We installed
the patch  (sun patch 100159-01) and the mysterious crash went away.
Actually, the crash started to occur almost daily whenever the users
were producing a ton of reports using Sybase. This led us to believe
that perhaps the software might be causing the problem instead of hardware as
we originally suspected (particularly since we pretty much replaced all
the hardware except the casing!).
I've decided to include all the responses in full (even tho it makes this
message real long) just in case some other
poor soul has a mysterious problem and finds the advice applicable.
Again, thanks to all.  And for those of you that were concerned about
your investments at Fidelity - don`t worry... everything is safe and sound!
Danielle Druery   => fmrco!moby!dani@uunet.uu.net
Fidelity Investments- ZQ1
82 Devonshire St.
Boston, MA 02109
(617) 439-1854
=======================================================================
========================== RESPONSES ==================================
=======================================================================
From: uunet!East.Sun.COM!stern (Hal Stern - Consultant)
have you installed the patch for bugid #1039406/1046009 (sun patch 100159-01)?
there is a bug in the TCP/IP loopback code that can cause a machine to hang
with pretty much the symptoms you've described.  sybase is one of the things
that can trigger it, although it has to be local sybase usage (maybe doing
a database dump or update on the server?)
in either case, a patch is available from sun (or from our local office).
you should give it a try and see if it cures the problems.
==========================================================================
From: uunet!bit!markm (Mark Morrissey)
I think you may be seeing a manifestation of the nfsd/biod thrashing
problem.  The way to be sure is let the system stay hung for a long
time (I think we let ours go for about 2.5 hours before it came back).
Also, just to be safe, get the 4.1 NFS jumbo patch and see if things
don't get better.  I am not sure if it is rolled into 4.1.1, so it
is hard to say if getting the jumbo patch or going to 4.1.1 is a
better step.
I am not positive that this will help, but it won't hurt to get those
patches into the system.
--mark
=========================================================================
From: uunet!cs.Buffalo.EDU!kensmith (Ken Smith)
My vote goes for finding your problem with the lanalyzer.  I've seen
servers go into the state you described.  At first I was mystified
too but finally found that when the server went into this state there
was a client machine somewhere on the network pounding the h*ll out
of the server with network traffic.  Rebooting the client that was
pounding on the server made the server happy again.  I later figured
out what the cause of the problem was : we switched an executable file
on the server and the client was running that executable.  When the
client needs to page in a piece of that executable that it had allowed
to slip out of main memory (not copying it to swap space, assuming that
the executable's file would always be there while the executable was
running which is a valid assumption on non-NFS systems...) it sends
an NFS request to the server asking for that chunk of the executable
back.  The server generates an error since the file isn't there any
more but the client doesn't accept that as an answer and makes the
request again (and again and again and ...).
I don't have any 4/470's around but I know a SparcStation can hammer on
a 3/280 so hard that the 3/280 is completely crippled.  I don't know if
anything you could have as a client can hammer hard enough to cripple
a 4/470 that bad or not...  Basically your server is spending all its
time in the kernel servicing the network traffic from the disgruntled
client and doesn't have any time left to do anything else...
I'm curious to know what the real problem is when you find it...
        ken
=====================================================================
From: uunet!ASC.SLB.COM!holle
Are you, by chance, running a screenblank or screensaver
program?  I had the same problem when I had screenblank
running in /etc/rc.local.
-Kathy Holle
(EDITORS NOTE: yes, we were running screenblank out of rc.local but removed
it when we received this message.... i wish the solution
 could have been that easy!)
=======================================================================
From: uunet!ucsd.edu!mrwallen (Mark R. Wallen)
We had a similar problem with a 4/370 used for instruction.
Under heavy loads caused by LISP, the machine would wedge
and only a power cycle would get it to come back to life.
(We have an ordinary terminal as the console, and BREAK would
not get into the PROM monitor as it should (equivalent to L1-A).
After swapping virtually everything, we finally tried another CPU
of a higher revision and the problem went away.  I believe that
the original CPU was something like rev 17 and the one that worked
was rev 24.  (Don't hold me to the actual numbers, though
I can probably get them if you wish).  The first CPU swap we
tried was about the same rev level as our failing one and it
too failed.  It seemed to require a revision quite a bit higher
to fix the problem.
Mark Wallen
========================================================================
From: uunet!ica.philips.nl!geertj (Geert Jan de Groot)
Don't have the slightest idea yet why your 4/470's misbehave. Can you
tell me:
a. the average load;
b. do you do anything weird with adding swapspace? this can kill a machine
c. I'm really interested in the output by 'ps' etc just before a machine
   goes down.
d. Feed a sun 3/60 from the same power plug as your 4/470's.
   3/60 have a bad power supply and will go first if power trouble.
(EDITORS NOTE: is this like bringing a canary into the coal mine ?!)
e. Junk on your console port?
f. Start using kadb, then start vmunix (b id()kadb), and try to
   get to kadb when the machine dies
g. I don't like sunscreens for console. disconnect keyboard and screen,
   and hook up a vt100 on ttya. see if that reacts when the machine dies.
   you need to reset (k2) the CPU to recognize the terminal as console.
i. Have you added boards? Double check jumpers on the backplane!
j. Are you using NFS? is the machine serving clients?
Please keep me posted! Maybe I can think of something with more info.
Good luck,
Geert Jan
==========================================================================
From: "Scot Gardner" <uunet!sun1.ise.ufl.edu!scot>
I'm not a Sun genius, however, I have had the "mysterious" hang
problem on my suns for quite a while. I've been told it's a
NFS bug where the machine gets a process in disk wait and
hangs after a minute or so. I forgot the exact problem, but
I hear there is a patch for Sun OS 4.1. If you use NFS quite
a bit, ask SUN about this.
Scot Gardner
==========================================================================
From: uunet!wiau.medical-biophysics.manchester.ac.uk!rick
\when you stsem crashes and you reset you get the boot prompt ">"
then type g0 this will put the contents of memory into the swap space.
onec the system is up run "savecore" this will give you a vmunix.0 and
vmcore.0 files.
you can then use utilities like ps (with -k option) and adb (see adb users guide
 there is a sction on kernal crashes)
This will tell you what was going on when you reset the sytetm. and its a load
less effort then ps every 30 seconds.
If you find a software problem sun (or whoever) can take these files and
look at them more thorouglhy to determine the problem.
Hope this helps.
RICK DIPPER, Wolfson Image Analysis Unit,                rick@uk.ac.man.mb.wiau
Department of Medical Bio-Physics, University of Manchester        061-275-5158
======================================================================
From: uunet!edsr!jcn (Jim Niemann)
Are you sure you are not running the SUN DBE software.
Here is patch info that seems to describe your problem
very nicely.
Patch-ID# 100119-01
Keywords: asynch i/o system hangs SunDBE Sybase SQL
Synopsis: SunDBE patches.system hangs with more than 100 concurrent asynch calls
Date: 17-Oct-1990
SunOS RELEASE: 4.1 4.1PSR-A
Topic: SunDBE asynchronous i/o fixes
BugID's fixed with this patch: 1044422 1044424
Architectures for which this patch is available: sun4, sun4c, sun4/490
Obsolete By:
Problem Description:
        1. 1044422
        When more than 100 asynchronous i/o requests are outstanding, the
        system will hang. The problem will most likely occur when SunDBE is
        running with several Sybase SQL servers, although it can also occur
        with one SQL Server and one or more other applications that are also
        using asynch i/o. It can also happen when no Sybase SQL Servers are
        running, but several applications aer doing large numbers of
        simultaneous asynch i/o's. The symptoms are that the system hangs and
        the following message is printed to the console:
        SUNDBE: asynch i/o in use - this is not an error
        The problem is caused by the number of outstanding asynchronous i/o's
        being greater than a fixed limit (100).
        2. 1044424
        The Sybase SQL Server hangs with status D (when displayed with ps(1)
         and the SQL Server cannot be killed. This error occrus very rarely,
        on 4/260 and 4/280 machines, and only when the system is paging.
        The problem can be avoided by not using Unix files for the database
        and having only the SQL Server funning on the machine (this avoids
        paging which caused the problem).
INSTALL:
SunDBE 1.0 must be installed before installing this patch. Do not install this
patch if you are not using SunDBE 1.0. This patch is for the following
machines running a SunDBE kernel:
   patch                           machine
sub-directory
   sun4             Sun-4/260, Sun-4/280, SPARCserver 330, SPARCserver 370
   sun4c            SPARCserver 1, SPARCserver 1+
 sun4_490           SPARCserver 390, SPARCserver 470, SPARCserver 490
As root do the following:
    1. copy patch files to appropriate directories
           # cd <to appropriate patch sub-directory -- see above>
           # cp dbe_asynch.h /usr/sys/sys
           # cp dbe_asynch.o /usr/sys/sun4{c}/OBJ
                   (where sun4{c} = sun4 or sun4c)
    2. A custom SunDBE kernel must be made and installed in order for the
       changes to take effect.
       make/remake and re/install a SunDBE custom kernel as explained in
       Chapter 3 of the Sun Database Excelerator Release Manual
=======================================================================
From: uunet!umiacs.UMD.EDU!steve (Steve D. Miller)
   Just because the system doesn't dump core on its own doesn't mean that
you can't make it dump core.  Uncomment the savecore stuff in /etc/rc.local
(or rc.whatever, I can never keep straight which one has savecore in it,
and I don't have a Sun handy right now to check).  Then check out the
PROM manuals and figure out how to stuff something nasty (0 or -1/ffffffff)
into the PC.  I'd be very surprised if there isn't some way to do this
on a SPARC machine; I know that you can use the "r" PROM command on Sun-3s
for this purpose, 'cause I've done so many a time.  I just haven't yet
needed to do this on a SPARCstation.
       -Steve
===========================================================================
From: uunet!megatek!felixw (Felix Wisgo CAE account)
Danielle/Rick--
        When you replaced memory and cpu boards, did you also
replace the SIMMs on the boards??? It seems to me, I had a problem
similar to yours a few years ago and it boiled down to a few bad
memory chips.  The architecture was different (Sun-3s), but it
may be worthwhile to try swapping out memory chips.  Whatever
the problem, I'd be interested in hearing your solution.
========================================================================
From: Doug Peterson <uunet!USAN.consult.com!doug>
Danielle -
I have seen a similar problem before, when I had intermixed SPARC's and
68020's. Two of the SPARCS were YP slave servers, and would take over the
YP queries (now DNS), and eventually hang, because they couldn't get updated inf
ormation from the non-SPARC master.
Also, I've seen some discussion about the P-MEG problem severly imparing perform
ance when installed RAM went above 16MB. However, I thought that this was fixed
in 4.1
I doubt that your problem is electrical/electronic. The power supplies used in S
un's are excellent power conditioners.
=============================================================================
From: uunet!einstein.eds.com!wazir (Deborah Wazir)
A few months ago our server crashed due to a power failure, and wouldn't
boot.  It would get to the file system checks and then exhibit the symptoms
described in Danielle's posting to Sun-Managers.  We swapped out everything,
but the problem kept coming back.
Finally, we figured out what the problem was.  When the server crashed, the
other machines on the network started sending packets to the server to see
if it was up yet.  They did this because each machine on the net NFS-mounts
about 12 filesystems from the server, and the server NFS-mounts the /home
partition from each machine on the net.  We have about 30 machines altogether.
So the longer the server was down, the more frequently the other machines
sent out their broadcasts.  This is referred to as a "broadcast storm".
When the server tried to boot,  it was fine until it got to the "ifconfig"
line.  As soon as it initialized its ethernet interface, and gave it an
IP address, it was able to start answering the other machines, and that's what
was bogging it down.  It was like all the machines were trying to re-establish
NFS connections at the same time, and the server was so busy trying to fend
them off that it never got to finish booting.
We noticed that if we disconnected the ethernet cable from the server, it would
finish booting.  The lights on the back would speed up immediately.
What we ended up doing was halting all the machines on the network; we were
then able to boot our server easily.  After that, we started up the other
machines one at a time.
Let me know if this helps you.  If it turns out to be another problem, I'd
be interested in hearing about that, too.
Good luck --
Deborah S. Wazir
From: kwthomas@nsslsun.gcn.uoknor.edu (Kevin W. Thomas)
I was the one that described the problem, and thought it was an OS bug.
Unfortunately, about a day or so later, it started all over again.  I was
running X (MIT version, patchlevel 18), Motif 1.1, and an application that
was developed elsewhere.  Anywhere from a few seconds to a few minutes after
starting the application, the system would hang.  Running a "top" at one
second intervals shows nothing unusual.  I even had a telnet session from
another workstation, and set my priority above everything running, and that
froze too.
On one of my attempted crash dumps, I got a couple of characters followed
by an infinite loop of the message "Spurious interrupt at processor level 14".
No amount yf L1-A's or disconnecting the keyboard cable would interrupt those
messages.  I had to turn the cpu power off and on to clear it.  At that stage,
I wouldn't let anyone run X on that system.  I also put in a service call.
Since the cpu board was replaced, I haven't had any problems.
        Kevin W. Thomas
-- thats the end --

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:14 CDT