SUMMARY: Sun 690MP hangs! help!

From: Jian Ye (ye@software.org)
Date: Thu Apr 28 1994 - 21:24:11 CDT


Hi,

Sorry for the late summary. This just a proliminary summary, since it's still
early to claim the problem has been fixed. The short story is:
    I almost break down and decide to upgrade the OS, then I found a patch
    100645-01 4.1.2;4.1.3: swapon with very large swap files hangs
    and I say to myself that could be it.....

I got 15 replies for my original question which
attached at end. I would like to thank the following people for their help.
all respones are included and the original post is at the end. The list
is in FIFO order.

     <rls02%philip%ALBNYDH2.bitnet@UACSC2.ALBANY.EDU> (Ronald L. Stamp)
     lees@cps.msu.edu
     sdillman@tech.concorde.com (Sean Dillman)
     david@srv.PacBell.COM (David St. Pierre)
     Adam Shostack <adam@bwh.harvard.edu>
     hoogs@SynOptics.COM (Tim Hoogasian)
     kwthomas@nsslsun.nssl.uoknor.edu (Kevin W. Thomas)
     s9ubxt@fnma.COM (Ben Taylor)
     antonson@umiacs.UMD.EDU (Todd S. Antonson)
     Robert Ogren <rmo@teltechlabs.com>
     "James D. Watson" <JW1675A@american.edu>
     roag@roses.rockwell.com (Gino Roa)
     jol@crimson.emsr.att.com (538280000-Ledva)
     weitzel@burke.com (David Weitzel)
     stern@sunrise.East.Sun.COM (Hal Stern - NE Area Systems Engineer)

Most of you suggested upgrade to 4.1.3 which is a good suggestion. However,
I am a little hesiltate to do that because other things might start to break
and I really don't want more problems. As I mentioned that another MP690
running 4.1.2 never had this kind of problem, I believe I can fix this
puppy without upgrade. What I need to do is to determine what exactly
caused the hang. I suspected the network, hardware, device driver, unix
kernel, and even intruders, tried a dozen things like running a script
every 5 minute to determine what happenned before the hang, take SCSI disk
off-line, changed ethernet cables, applied one dozen patches. Unfortunately,
none of them worked. The machine keeps dieing at random hours, mostly weekend
with no load or very light load. If I was lucky to actually witness it's death,
it's not pretty, the xload shows a dark curve shoots up the sky exponentally.

Then one day (two weeks ago), I was looking at the half dozen coredumps I
collected, I found an interesting fact when I run diagnoses use sps which
-Alfgw opion. The swapper died waiting at the same signal as the last
running process. i.e.

Ty User Status Fl Nice Prv Shr Res %M Time Child %C Proc# Command
   root X489F4 U 0+ 0 0 0 0.0 0 Unix Swapper
      | X489F4 7416+5972 652 1 0.0 2200 /interleaf/ileaf5/sun4/bin/ileafbin/ileaf -display 129.28.5.39:0.0 -mousebuttons sme

This showed that both the swapper and interleaf process is waiting for a same
signal X489F4. All the coredump showed similar behavior with randomly
various processes.

So I conclude that swapper died and cause system to hang. So I check the
patch list and guess what? SUN has a patch for it ( as I thought so :) ).

100645-01 4.1.2;4.1.3: swapon with very large swap files hangs

and my machine happens to have a 100 meg swap file(a difference I forgot to
mention). So I applied the patch with skeptism since I have failed so many
times. And wouldn't you know... the machine didn't die last weekend. The
first time in two months.
Well, It's still too early to claim the problem is fixed. It might be
just a lucky weekend. But I like to give a preliminary summary. If
there is any new development of this I will post another summary.

The moral of this is that one has to pay attension to every little difference
in the system configurations since any of them can be your true enermy.

The following this detailed respones:

>From UACSC2.ALBANY.EDU!ALBNYDH2.BITNET!rls02%philip (Ronald L. Stamp)

  Well,

  I'm no wizard but suffered similar problem, had 690MP with 4 Cypress
  modules and 4.1.2 on IPI disk. Trying to move to 4.1.3 on SCSI
  disc found loadable module ( Central Data SCSI Terminal Server )
  behaved badly, machine hung, etc.

  Sun finally mentioned " SunOS 4.x is unsupported with multiple
  processors ". Turned out that we also had a 670 with 2 TI SuperSparc
  modules which began acting the same way when we tried to put the
  loadable module on that machine.

  Result: took 1 SuperSparc from 670 and replaced the 4 Cypress in
  the 690. Neither machine hangs, and the 690 is now MUCH faster!?$$(&%

  I do wish that they had mentioned this while taking our money for
  the multiple processors, but at least now it has been stable over
  two months.

  Hope this is some help

> Thanks Ron, I was half inch close to take a saw and cut out the other
> three processors

>From cps.msu.edu!lees Wed Apr 13 09:40:02 1994

SunOS on the 6xxMP has a flaw in the process scheduling and
memory management such that any process that tries to grab
all of memory will hang the machine for a while (ten or
fifteen minutes). This is not fixed until the Solaris 2.3
release of the operating system. The following simple program
exercises this problem:

 /* chunk.c, lees@pixel.cps.msu.edu, 14 May 1992
  *
  * Hang a 690 MP by trying to grab a ridiculous amount of memory.
  */
 #include <stdio.h>
 #include <malloc.h>

 /*
  * Experiment with SIZE
  */
 #define SIZE 204857600

 main() {
         char *p;
         int i = 0;

         printf("Using %d as SIZE.\n", SIZE);
         while (1) {
                 printf("%d ", i++);
                 fflush(stdout);
                 if (!(p = malloc((unsigned)SIZE))) {
                         printf("\nOut of memory after %d calls to malloc.\n", i);
                         exit(1);
                         }
                 else {
                         memcpy(p, p, (unsigned)SIZE);
                         }
                 }
         }
----------
 John Lees, Pattern Recognition & Image Processing Laboratory,
 Systems Analyst & Lab Manager, Department of Computer Science,
 Michigan State University, A714 Wells Hall, East Lansing,
 MI 48824-1027, lees@cps.msu.edu, http://web.cps.msu.edu/~lees/
 Member, League for Programming Freedom. lpf@uunet.uu.net for info.

From: sdillman@tech.concorde.com (Sean Dillman)

I hate to sound like sun, but I'd say the best patch would be going
to 4.1.3U1.

-Sean P. Dillman
 Senior Technician
 The Concorde Group, Ltd.

From: david@srv.PacBell.COM (David St. Pierre)

i had a 670/4.1.3_U1 which exhibited similar problems. i would suggest

100726-12 (or greater) and
101408-01 (or greater)

Patch-ID# 100726-12
Keywords: system, watchdog,panic, faults, mbus-to-sb, asynchrono, multiple, hang
Synopsis: SunOS 4.1.3: sun4m jumbo patch for kernel performance and memory bugs
Date: Nov/23/93

SunOS release: 4.1.3, 4.1.3C

---- much text deleted ----

From: Adam Shostack <adam@bwh.harvard.edu>

        Why not install 4.1.3.1, or whatever the latest shipping
version of SunOS is? It should include most patches.

Adam

From: hoogs@SynOptics.COM (Tim Hoogasian)

You might consider loading 4.1.3_u1(b).

-Tim

From: kwthomas@nsslsun.nssl.uoknor.edu (Kevin W. Thomas)

The first thing I would do is to upgrade your system to run 4.1.3. The
second thing I would do is to install the latest version of patch 100726
(last time I checked, it was at version 13). This fixes a large number of
sun4m problems including system hangs. If your 690 is a model 51, then
you'll also need patch 101408-01.

(As an alternate, you could upgrade to 4.1.3_U1. In this case you would need
the 4.1.3_U1 version of 100726. I don't know the number off hand. 101408-01
might still be ok, as it is a shell script.)

        Kevin W. Thomas

From: s9ubxt@fnma.COM (Ben Taylor)

This is during your backups, isn't it? What are you using?

> no I am afraid not, it happens at a total random, but mostly weekend, so
> it can keep me busy all the time. :)

This sounds like an automounter hang, but I don't have enough info.
If not, it pretty much indicates to me that your network has gone to lunch.
You can set the kernel variable MAXSLP to 0x7fffffff which will keep
things from getting swapped out after the default of 20 seconds.

We had a very similar problem, but it involved an NC400 ethernet board.
The problem turned out to be the board's software not handling
giant packets (thoughtfully sent out by our cisco routers under load.)
I notice you do not run the lockd, nor the inetd patch (Inetd I think
is 100178-08,lockd is 100075-11)

> not automounter, I suspected that too, and even installed patch.

Laugh. Yes, this is a joke. Boneheads at Sun are pretty useless to me
at this point. I'd rather download the patches and figure out what
I need. I used to hate installing patches till I wrote a really cool
script to do the installs, and create a backout script just in case.

> Agreed, sun has the lousiest support and the funny thing is that their
> stuff keeps breaking. Sun will follow the path of IBM.

Ben Taylor
Sr. Systems Administrator
bent@fnma.com

From: antonson@umiacs.UMD.EDU (Todd S. Antonson)

> Thanks buddy, I know you will come to the rescue. Your reponse is
> the most detailed as always, and it's not difficult to see why. :)
> Now I know why you left SPC ( just kidding). I will try your
> method of upgrading when I get a chance :) no kidding. I can be sure
> this time what sun is going to tell me, when the machine breaks. No
> offence, SUN!!!

Woah!

Try to install as few patches as possible. You never really know when
some of them will not agree with eachother. I know Sun will tell you
that they work together, but I don't have a lot of faith in what they
say.

> Did you hear that SUN!!!

Well. Those drivers have been installed for quite some time. I doubt they
would suddenly start acting up.

Here is something you can try. Do you have any newer revs of the OS on CD?
The latest rev is 4.1.3 rev B (also known as 4.1.3_U1). I am running that
on a couple LX's over here. I am still using the old vanilla 4.1.3 /usr
though and there doesn't appear to be any problems (You may have to replace
/usr/lib/ld.so for things to work properly though -- I had to.).

Grab just the kvm and the kvm/sys stuff from the CD. It should be under
exec or something like that. Poke around the CD until you get comfortable
with the contents and where stuff is located. The files on CD are tar files
so you can use `tar tf' to get the contents of the files. It is important
to dump the sys file under the kvm directory tree (so you get kvm/ and kvm/sys/.
You could build a whole new /usr partition as well if you wanted. Just
grab the files you need to create a new one. I know I did not install
a lot of things from the OS. Now, if you become an expert in the contents
of the CD, you can build your own /usr without running that stupid suninstall
program and boom! You have upgraded the whole thing.

Next, build a kernel in that new kvm tree and install it as /vmunix on HYDRA
and reboot it. It should come up without problems although I am a little
less confident since you are running 4.1.2 and not 4.1.3. You may need to
grab an extra disk partition and install a new /usr as well since HYDRA is running
4.1.2. Just add it in temporarily and mount it as /usr (Or just upgrade HYDRA
and see what happens -- Can't be worse than frequent intermittent outages.).

If you still experience problems, then I would guess there is a hardware
problem (A new OS should be proof enough!). You may need to run a bunch of
diags to try and figure out what the problem is. Typically, the hardware
support (aren't support cotracts great?) crew will just start swapping parts
until the problem goes away.

I would recommend you try and updrade HYDRA asap. SunOS 4.1.2 is pretty old.
I haven't heard of any software that worked under 4.1.2 and stopped working
under 4.1.3. At most, you would need to rebuild something or get an update
from a vendor. If you have a disk or two available, you could install
a new OS on that disk and still have the old OS available if you run into
problems. MAKE SURE YOU CAN BOOT HYDRA DISKLESS OFF ANOTHER SERVER INCASE
YOU NEED TO RESTORE FOR ANY REASON.

Hope this helps.

From: Robert Ogren <rmo@teltechlabs.com>
I really hate to say this but I will. I was running into similar problems
with my 670 under 4.1.2. When I moved to 4.1.3 all seems to work.
I think Sun had some weird problems with 4.1.2 in MP mode. I've been
running with 4.1.3 for 1.5 year no problems and recently went to the
Ross HyperSparcs with no problems. I know upgrades are a pain but
one has to way the frustatrations.

Later
Rob

From: "James D. Watson" <JW1675A@american.edu>

Hi Jian --- can't really help with the 690MP hangs (sorry!) unless
you have compartmented mode workstations from Sun on your net. Sun's
CMW machines have an error in them (not sure where): upon boot-up,
a CMW (properly) sends out a broadcast indicating what classification
it runs at. Our SunOS4.1.3 SPARC2s and 690MPs were crashing HARD when
they received this broadcast. Sun fixed the problem in the CMW OS.

What I really wanted to point out: you write that when your machine
crashes it's still pingable. The default ICMP version of ping (i.e. the
standard version of ping) can actually be answered by the network interface
card, so just because your machine is pingable (as you now know) doesn't
mean the CPU is healthy. In the "SysAdmin" periodical last year, some
fellow out of (a school in California, I don't remember which) published
code called "newping" which actually attempts a TCP connection: if the
TCP connection is successfully made, then you know your CPU is up and
running AND you know the network path is still up. It also reports
back if the network is up but your CPU is down, and behaves in general
like ping(1) in other regards.

He published a follow on article too with a few (minor) modifications.
I modified it a bit myself to connect to a different TCP port, but otherwise
it has served us quite well.

If you want the reference, I can get that (it's at work; I'm at home) so
e-mail me back. I can't just upload the software because, since I worked
on it at work, they consider it sensitive...

Good luck,
Jim

From: roag@roses.rockwell.com (Gino Roa)

        My suggestion is upgrade to SunOS 4.1.3 as soon as possible. We
        had the exact same problems on our Sun490's. The 4.1.3 upgrade
        was a cure all solution for us.

        Gino Roa
        Network Manager
        Rockwell International

From: jol@crimson.emsr.att.com (538280000-Ledva)

Howdy,

This may be a long shot but I had a 670 that kept hanging. I upgraded
the cpu from Rev7 to Rev8 and that took care of the problem.
Good luck !

Joe Ledva
crimson!jol@aluxpo.att.com

From: weitzel@burke.com (David Weitzel)

Jian, about your 690 hanging. Are you by any chance running
backups or dumps near the time your machine hangs?

-Dave Weitzel
weitzel@burke.com

From: stern@sunrise.East.Sun.COM (Hal Stern - NE Area Systems Engineer)

this looks very much like you're exhausting the kernel memory. it's
a bug in 4.1.2, and is fixed in 4.1.3_U1. try getting patch
100330 (or whatever superceded it). this increases the size of
kernel memory and usually fixes the problem. the magma driver
may be taking a large chunk of kernel memory for printer data,
and that could be making the problem worse -- the more it takes,
the less you have so you run out more quickly.

--hal

Original post:

>
> Hi Wizards!
>
> I about to give up on this. I have a Sun 690 with 4.1.2 keeps hanging once
> or twice a week. We reboot our machine once a week and that server
> can't make even half way. It 95% of time hangs over the weekend and night at
> very strange hour. When it hang it still ping able, xload some time shows
> it's load shot up the sky all of the sudden. Several core dumps have been
> taken, it showed everything is swaped out. The answer I got from sun
> support are lousy "oh did you install patch #xxxxx, we think the problem is
> solved with those patches." Then they closed the call in 24 or 48 hours.
> The next time the system hangs, they give me another list of patches. Since
> they have several thousand of those little patches, they don't think they
> will run out of them. Too bad, I am turn in to this list for help. Bye Sun!
>
> The system has following patches installed thanks to Sun support!
>
> 100173-10 SunOS 4.1.1;4.1.2;4.1.3 : NFS Jumbo Patch
> 100249-08 SunOS 4.1.1;4.1.2,4.1.3: automounter jumbo patch
> 100359-06 SunOS 4.1.1;4.1.2;4.1.3: streams jumbo patch
> 100537-01 SunOS 4.1.2: async io peaks can hang system
> 100539-01 umount of busy hsfs filesystem causes panic data fault
> 100575-03 SunOS 4.1.2: MP machines do not perform as well as 4XX equivalent
> 100584-03 system freezes using loopback interface,BSD4.2 keepalive
> 100584-05 system freezes using loopback interface,BSD4.2 keepalive
> 100623-03 4.1.2;4.1.3: UFS jumbo patch
> 100804-02 SunOS 4.1.1,4.1.2,4.1.3: TCP socket and reset problems
> 100804-03 SunOS 4.1.1,4.1.2,4.1.3: TCP socket and reset problems
>
> The problem did not change a bit after this patches. There is another server
> with no patch installed never had any hang or reboot. After careful
> comparison, I found that there are some hardware and software difference.
>
> hardware: the problem machine has 64MB RAM 1/2 less than the other one.
> Though I don't think it should cause system to hang.
>
> Software: Two loadable kernel drivers
>
> Id Type Loadaddr Size B-major C-major Sysnum Mod Name
> 2 Drv ff053000 12000 60. MAGMA Sp Adapter
> 1 Pdrv ff02e000 25000 59. uShare EtherTalk
>
> one is a MAGMA Sp parallel printer port and the other one is uShare EtherTalk
> printer server for the macs.
>
> I think now the problem is indeed one of the kernel drivers, but I can't tell
> which one in order to avoid pointing nose situation.
>
> So I question at last is:
>
> Can anyone give me a pointer on how to check problems with modloaded drivers
> from coredump. Since the hang is so intermitent and leaves no error message
> The only way to generate a coredump is L1-A and sync.
>
> Any input is greatly appreciated. I apologize if you seen this before, as
> you can see I am still in the learning curve.
>
> Best Regards!
>
> -- Jian
>



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:08:59 CDT