SUMMARY: 4/280 hangs inexplicably

From: Andrew Bohonis (andrewb@cs.adelaide.edu.au)
Date: Mon May 27 1991 - 18:53:30 CDT


THE ORIGINAL PROBLEM:

> We have a sun 4/280 running 4.1.1 (unpatched) which becomes catatonic
> every one or two days. When this happens, the only thing the machine
> responds to is a break at the console. We have produced a crash dump
> which shows 40 or more processes running when examined with ps.
>
> I seem to recall a similar problem being described in this group recently,
> unfortunately I didn't save it. If there is a summary for this problem,
> I would appreciate it if someone would mail it to me.
Just to provide some more background the machine has 5 diskless clients
(3/50s and 3/60s) and in addition is an NFS server to a couple of
SParcStation1's and IPC's.

As it turns out I had imagined the summary or confused it with another,
similar problem. Quite a few people requested a copy of this summary, so
I thought I'd post this one after making sure the installed fix really worked.

THE SOLUTION:

         The solution (from Sven Ole Skrivervik <svenole@sdata.no>)
         was to install the rpc.lockd jumbo patch (patchid# 100075-06)
         which is available from princeton.edu:/pub/sun-fixes/sunos4.1.1
         The machine that used to toss its cookies about once a day has now
         been running healthily for about a week.

THANKS
to all the helpful people who replied. I have appended the messages to
this summary since despite the fact it makes a long posting they may contain
useful advice for others experiencing a similar mysterious problem.

A FEW OTHER SUGGESTIONS:

antonson@software.org (Todd S. Antonson) suggested to check if the
partitioning of the disks was correct.

jnapier@UCSD.EDU (Jim Napier)
and kwthomas@nsslsun.gcn.uoknor.edu (Kevin W. Thomas)
independently mentioned it might be a hardware specific problem.
However in these cases the computer would not even respond to a break or
L1-A at the console.

fmrco!moby!dani@murtoa.cs.mu.oz@uunet.uu.net (Danielle Druery)
told of a similar problem she encountered with a 4/470 (4.1) that was
cured by installing the tcp/ip loopback patch (patchid# 100159-01). She
also kindly forwarded the summary of responses to her problem.

-- 
Andrew Bohonis       
Department of Computer Science
University of Adelaide        
andrewb@cs.adelaide.edu.au

THE RESPONSES

From: antonson@software.org (Todd S. Antonson) I don't know if this will help or not, but we have a sun4/280 running 4.0.3 and it used to hang a lot.

I reformatted a disk that seemed to always have a spot on it that fsck couldn't handle and it hasn't hung since. Check you disks to make sure the partitioning is correct.

Good luck, I'm looking forward to a summary as I will be upgrading to 4.1.1 shortly.

-- ======================================================================= Todd S. Antonson | Software Productivity Consortium |UUCP : ..!uunet!software!antonson SPC Building -- 2214 Rock Hill Road |CSNET: antonson@software.org Herndon, Virginia 22070 | =======================================================================

From: stern@sunne.East.Sun.COM (Hal Stern - Consultant) what is the machine doing? running as an NFS server? what are the processes doing? what processes are they? can you send output of crash/ps to us?

--hal

From: jnapier@UCSD.EDU (Jim Napier) Andrew-

I haven't seen any trouble with 4/280's running 4.1.1. We have one that's been humming along like a champ without even so much as a hiccup since we installed 4.1.1 about 2 months ago. I know there are other 4/280s on our campus running 4.1.1 also without problems. It sounds like you have some hardware specific problem (CPU going bad??). A little more detail on your environment would help this discussion (e.g. does it serve a lot of diskless clients, etc.). I don't remember anything like this being discussed recently in sun-managers but if you get a copy of a summary on this topic I'd like one also so I can be prepared. Thanks.

Jim Napier Programmer/Analyst Applied Mechanics & Engineering Sciences Dept. U.C. San Diego (619)534-5414 napier@ames.ucsd.edu, jnapier@ucsd.edu, ...!ucsd!jnapier

From: fmrco!moby!dani@murtoa.cs.mu.oz@uunet.uu.net (Danielle Druery) Hi Andrew....

I had a similar "hanging" problem with a 4/470 (4.1). I've enclosed the summary which lists all the ideas people sent. Perhaps one will help. Our problem was solved by installing the tcp/ip loopback patch. Hope this helps!!!

-danielle

*********************************************************************** To: sun-managers@eecs.nwu.edu Subject: SUMMARY: 4/470 Mysterious Crash

My sincere gratitude to all who responded. We are back up and running 24hours/day without any crashes and the management is happy about Suns again.

I apologize for the delay in summarizing but I wanted to wait until enough time had passed to make sure the fix we installed was truly successful.

Hal Stern hit the nail on the head. Since we are running intense Sybase applications locally on the Sun via telnet we encountered a bug (bugid #1039406/1046009) with the TCP/IP loopback code. We installed the patch (sun patch 100159-01) and the mysterious crash went away. Actually, the crash started to occur almost daily whenever the users were producing a ton of reports using Sybase. This led us to believe that perhaps the software might be causing the problem instead of hardware as we originally suspected (particularly since we pretty much replaced all the hardware except the casing!).

I've decided to include all the responses in full (even tho it makes this message real long) just in case some other poor soul has a mysterious problem and finds the advice applicable.

Again, thanks to all. And for those of you that were concerned about your investments at Fidelity - don`t worry... everything is safe and sound!

Danielle Druery => fmrco!moby!dani@uunet.uu.net Fidelity Investments- ZQ1 82 Devonshire St. Boston, MA 02109 (617) 439-1854 ======================================================================= ========================== RESPONSES ================================== ======================================================================= From: uunet!East.Sun.COM!stern (Hal Stern - Consultant)

have you installed the patch for bugid #1039406/1046009 (sun patch 100159-01)?

there is a bug in the TCP/IP loopback code that can cause a machine to hang with pretty much the symptoms you've described. sybase is one of the things that can trigger it, although it has to be local sybase usage (maybe doing a database dump or update on the server?)

in either case, a patch is available from sun (or from our local office). you should give it a try and see if it cures the problems.

========================================================================== From: uunet!bit!markm (Mark Morrissey)

I think you may be seeing a manifestation of the nfsd/biod thrashing problem. The way to be sure is let the system stay hung for a long time (I think we let ours go for about 2.5 hours before it came back).

Also, just to be safe, get the 4.1 NFS jumbo patch and see if things don't get better. I am not sure if it is rolled into 4.1.1, so it is hard to say if getting the jumbo patch or going to 4.1.1 is a better step.

I am not positive that this will help, but it won't hurt to get those patches into the system.

--mark =========================================================================

From: uunet!cs.Buffalo.EDU!kensmith (Ken Smith)

My vote goes for finding your problem with the lanalyzer. I've seen servers go into the state you described. At first I was mystified too but finally found that when the server went into this state there was a client machine somewhere on the network pounding the h*ll out of the server with network traffic. Rebooting the client that was pounding on the server made the server happy again. I later figured out what the cause of the problem was : we switched an executable file on the server and the client was running that executable. When the client needs to page in a piece of that executable that it had allowed to slip out of main memory (not copying it to swap space, assuming that the executable's file would always be there while the executable was running which is a valid assumption on non-NFS systems...) it sends an NFS request to the server asking for that chunk of the executable back. The server generates an error since the file isn't there any more but the client doesn't accept that as an answer and makes the request again (and again and again and ...).

I don't have any 4/470's around but I know a SparcStation can hammer on a 3/280 so hard that the 3/280 is completely crippled. I don't know if anything you could have as a client can hammer hard enough to cripple a 4/470 that bad or not... Basically your server is spending all its time in the kernel servicing the network traffic from the disgruntled client and doesn't have any time left to do anything else...

I'm curious to know what the real problem is when you find it...

ken ===================================================================== From: uunet!ASC.SLB.COM!holle

Are you, by chance, running a screenblank or screensaver program? I had the same problem when I had screenblank running in /etc/rc.local.

-Kathy Holle

(EDITORS NOTE: yes, we were running screenblank out of rc.local but removed it when we received this message.... i wish the solution could have been that easy!)

======================================================================= From: uunet!ucsd.edu!mrwallen (Mark R. Wallen)

We had a similar problem with a 4/370 used for instruction. Under heavy loads caused by LISP, the machine would wedge and only a power cycle would get it to come back to life. (We have an ordinary terminal as the console, and BREAK would not get into the PROM monitor as it should (equivalent to L1-A).

After swapping virtually everything, we finally tried another CPU of a higher revision and the problem went away. I believe that the original CPU was something like rev 17 and the one that worked was rev 24. (Don't hold me to the actual numbers, though I can probably get them if you wish). The first CPU swap we tried was about the same rev level as our failing one and it too failed. It seemed to require a revision quite a bit higher to fix the problem.

Mark Wallen

======================================================================== From: uunet!ica.philips.nl!geertj (Geert Jan de Groot)

Don't have the slightest idea yet why your 4/470's misbehave. Can you tell me: a. the average load; b. do you do anything weird with adding swapspace? this can kill a machine c. I'm really interested in the output by 'ps' etc just before a machine goes down. d. Feed a sun 3/60 from the same power plug as your 4/470's. 3/60 have a bad power supply and will go first if power trouble. (EDITORS NOTE: is this like bringing a canary into the coal mine ?!) e. Junk on your console port? f. Start using kadb, then start vmunix (b id()kadb), and try to get to kadb when the machine dies g. I don't like sunscreens for console. disconnect keyboard and screen, and hook up a vt100 on ttya. see if that reacts when the machine dies. you need to reset (k2) the CPU to recognize the terminal as console. i. Have you added boards? Double check jumpers on the backplane! j. Are you using NFS? is the machine serving clients?

Please keep me posted! Maybe I can think of something with more info.

Good luck,

Geert Jan

==========================================================================

From: "Scot Gardner" <uunet!sun1.ise.ufl.edu!scot>

I'm not a Sun genius, however, I have had the "mysterious" hang problem on my suns for quite a while. I've been told it's a NFS bug where the machine gets a process in disk wait and hangs after a minute or so. I forgot the exact problem, but I hear there is a patch for Sun OS 4.1. If you use NFS quite a bit, ask SUN about this. Scot Gardner ==========================================================================

From: uunet!wiau.medical-biophysics.manchester.ac.uk!rick

\when you stsem crashes and you reset you get the boot prompt ">"

then type g0 this will put the contents of memory into the swap space.

onec the system is up run "savecore" this will give you a vmunix.0 and vmcore.0 files.

you can then use utilities like ps (with -k option) and adb (see adb users guide there is a sction on kernal crashes) This will tell you what was going on when you reset the sytetm. and its a load less effort then ps every 30 seconds.

If you find a software problem sun (or whoever) can take these files and look at them more thorouglhy to determine the problem.

Hope this helps.

RICK DIPPER, Wolfson Image Analysis Unit, rick@uk.ac.man.mb.wiau Department of Medical Bio-Physics, University of Manchester 061-275-5158

======================================================================

From: uunet!edsr!jcn (Jim Niemann)

Are you sure you are not running the SUN DBE software. Here is patch info that seems to describe your problem very nicely.

Patch-ID# 100119-01 Keywords: asynch i/o system hangs SunDBE Sybase SQL Synopsis: SunDBE patches.system hangs with more than 100 concurrent asynch calls

Date: 17-Oct-1990

SunOS RELEASE: 4.1 4.1PSR-A

Topic: SunDBE asynchronous i/o fixes

BugID's fixed with this patch: 1044422 1044424

Architectures for which this patch is available: sun4, sun4c, sun4/490

Obsolete By:

Problem Description:

1. 1044422

When more than 100 asynchronous i/o requests are outstanding, the system will hang. The problem will most likely occur when SunDBE is running with several Sybase SQL servers, although it can also occur with one SQL Server and one or more other applications that are also using asynch i/o. It can also happen when no Sybase SQL Servers are running, but several applications aer doing large numbers of simultaneous asynch i/o's. The symptoms are that the system hangs and the following message is printed to the console:

SUNDBE: asynch i/o in use - this is not an error

The problem is caused by the number of outstanding asynchronous i/o's being greater than a fixed limit (100).

2. 1044424

The Sybase SQL Server hangs with status D (when displayed with ps(1) and the SQL Server cannot be killed. This error occrus very rarely, on 4/260 and 4/280 machines, and only when the system is paging. The problem can be avoided by not using Unix files for the database and having only the SQL Server funning on the machine (this avoids paging which caused the problem).

INSTALL:

SunDBE 1.0 must be installed before installing this patch. Do not install this patch if you are not using SunDBE 1.0. This patch is for the following machines running a SunDBE kernel:

patch machine sub-directory

sun4 Sun-4/260, Sun-4/280, SPARCserver 330, SPARCserver 370 sun4c SPARCserver 1, SPARCserver 1+ sun4_490 SPARCserver 390, SPARCserver 470, SPARCserver 490

As root do the following:

1. copy patch files to appropriate directories

# cd <to appropriate patch sub-directory -- see above> # cp dbe_asynch.h /usr/sys/sys # cp dbe_asynch.o /usr/sys/sun4{c}/OBJ

(where sun4{c} = sun4 or sun4c)

2. A custom SunDBE kernel must be made and installed in order for the changes to take effect.

make/remake and re/install a SunDBE custom kernel as explained in Chapter 3 of the Sun Database Excelerator Release Manual

=======================================================================

From: uunet!umiacs.UMD.EDU!steve (Steve D. Miller)

Just because the system doesn't dump core on its own doesn't mean that you can't make it dump core. Uncomment the savecore stuff in /etc/rc.local (or rc.whatever, I can never keep straight which one has savecore in it, and I don't have a Sun handy right now to check). Then check out the PROM manuals and figure out how to stuff something nasty (0 or -1/ffffffff) into the PC. I'd be very surprised if there isn't some way to do this on a SPARC machine; I know that you can use the "r" PROM command on Sun-3s for this purpose, 'cause I've done so many a time. I just haven't yet needed to do this on a SPARCstation. -Steve

=========================================================================== From: uunet!megatek!felixw (Felix Wisgo CAE account)

Danielle/Rick--

When you replaced memory and cpu boards, did you also replace the SIMMs on the boards??? It seems to me, I had a problem similar to yours a few years ago and it boiled down to a few bad memory chips. The architecture was different (Sun-3s), but it may be worthwhile to try swapping out memory chips. Whatever the problem, I'd be interested in hearing your solution.

========================================================================

From: Doug Peterson <uunet!USAN.consult.com!doug>

Danielle -

I have seen a similar problem before, when I had intermixed SPARC's and 68020's. Two of the SPARCS were YP slave servers, and would take over the YP queries (now DNS), and eventually hang, because they couldn't get updated inf ormation from the non-SPARC master.

Also, I've seen some discussion about the P-MEG problem severly imparing perform ance when installed RAM went above 16MB. However, I thought that this was fixed in 4.1

I doubt that your problem is electrical/electronic. The power supplies used in S un's are excellent power conditioners.

=============================================================================

From: uunet!einstein.eds.com!wazir (Deborah Wazir)

A few months ago our server crashed due to a power failure, and wouldn't boot. It would get to the file system checks and then exhibit the symptoms described in Danielle's posting to Sun-Managers. We swapped out everything, but the problem kept coming back.

Finally, we figured out what the problem was. When the server crashed, the other machines on the network started sending packets to the server to see if it was up yet. They did this because each machine on the net NFS-mounts about 12 filesystems from the server, and the server NFS-mounts the /home partition from each machine on the net. We have about 30 machines altogether. So the longer the server was down, the more frequently the other machines sent out their broadcasts. This is referred to as a "broadcast storm". When the server tried to boot, it was fine until it got to the "ifconfig" line. As soon as it initialized its ethernet interface, and gave it an IP address, it was able to start answering the other machines, and that's what was bogging it down. It was like all the machines were trying to re-establish NFS connections at the same time, and the server was so busy trying to fend them off that it never got to finish booting.

We noticed that if we disconnected the ethernet cable from the server, it would finish booting. The lights on the back would speed up immediately.

What we ended up doing was halting all the machines on the network; we were then able to boot our server easily. After that, we started up the other machines one at a time.

Let me know if this helps you. If it turns out to be another problem, I'd be interested in hearing about that, too.

Good luck -- Deborah S. Wazir

From: kwthomas@nsslsun.gcn.uoknor.edu (Kevin W. Thomas)

I was the one that described the problem, and thought it was an OS bug. Unfortunately, about a day or so later, it started all over again. I was running X (MIT version, patchlevel 18), Motif 1.1, and an application that was developed elsewhere. Anywhere from a few seconds to a few minutes after starting the application, the system would hang. Running a "top" at one second intervals shows nothing unusual. I even had a telnet session from another workstation, and set my priority above everything running, and that froze too.

On one of my attempted crash dumps, I got a couple of characters followed by an infinite loop of the message "Spurious interrupt at processor level 14". No amount yf L1-A's or disconnecting the keyboard cable would interrupt those messages. I had to turn the cpu power off and on to clear it. At that stage, I wouldn't let anyone run X on that system. I also put in a service call. Since the cpu board was replaced, I haven't had any problems.

Kevin W. Thomas

-- thats the end --



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:14 CDT