Subject: SUMMARY: Machine hangs every night (Longish .... and late )

From: vispi@lgc.com
Date: Thu Nov 07 1991 - 22:02:45 CST


Hi Everybody
Pardon the late summary. Our problem was conclusively fixed yesterday. My original
posting follows:

==>
==>
==>
==>Hi sun managers
==>
==>We run the Legato backup product Networker every night. Our backup
==>server is a Sparc 4/370 32 Megs memory and 137 Megs of swap. running 4.1.1
==>Every night around 3:00 am the machine JUST STOPS.
==>Yes, it just stops, there is nothing in the messages file, nothing in
==>the Legato backup logs. We have a cronjo that kicks off every night
==>at 3:15 am.
==>
==>15 3 * * * find / -name .nfs\* -mtime +7 -exec rm -f {} \; -o -fstype nfs -prune
==>
==>This is a standard cron entry for all sun machines.
==>
==>In the morning i get the following results.
==>
==>rsh: no response.
==>telnet: Trying <IP address>
==> Connected to <machine name>
==> Escape character is '^]'.
==>
==>
==> NOTE: I don't get the login prompt.
==>
==>Console has frozen, and any windows left open on the machine from a workstation
==>are all frozen. There is nothing I can do accept Break and reboot.
==>This happens consistently for the past 3 days.
==>Any help will be greatly appreciated. Thank you in advance.
==>
==> -Vispi Dumasia
==> vispi@lgc.com

- -------

Many thanks to:
kwthomas@nsslsun.nssl.uoknor.edu (Kevin W. Thomas)
cdr@acc.stolaf.edu (Craig D. Rice)
Chris Maio <chris@boxhill.com>
John R. Kilheffer <amp19263@garfield.amp.com>
earl@division.cs.columbia.edu (Earl Smith)
jwseave@srv.pacbell.com (Jim Seavey)
jeanneg@iscnvx.lmsc.lockheed.com (J Y Gee)
Bruce Arden <arden@tcom.stc.co.uk>

The common advise was that its a swap space problem. While it is true that NetWroker
consumes large quantities of swap, for a parallellism of 4, 100 Megs plus swap is enough.
I beefed up the swap, and reduced the parallellism to 2, but the problem did not go away.

It was suggested that I was running out of mbufs. Increased those, no avail.

There were no non standard devices on the machine, so we were clean on that count.

Bruce Arden suggested it might be a disk problem, sometimes the disk fsck's clean, but any
access to a certain section can send the kernel into an endless loop. He suggested I dump,
newfs (I reformatted) and restore the disk/filesystem. I think this was the cause, since
I could replicate it by running a fast find on that filesystem. Incidentally that's the
same filesystem on which the NetWorker indexes resided.

Something which I discovered, was that in the "Hidden Options" menu in nsradmin
there is a sub-option "Release State". Somehow that was set to Test instead of Production
as it should be. I changed it back to Production. If someone's interested email me and
I'll send details.

        -Vispi
        vispi@lgc.com

Individual responses follow.
- ----

Make sure you have "savecore" enable in /etc/rc.local. When you come in the
morning, and it is frozen, do a L1-A, and type a "g0". This will generate
a "panic: zero". It will also give make a crash dump of the system. You
can do a "ps lawxk vmunix.# vmcore.#" to see what the system looked like
before you forced the dump. The "crash" utility can be used to analyze
crash dumps, though I'm not very familar with it.

        Kevin W. Thomas
        National Severe Storms Laboratory
        Norman, Oklahoma

=====

Yes yes yes! We are having this exact same problem... We have three
SS2 fileservers, 32MB RAM, 1GB disk. The only server we see this problem
with is the one running Networker software. This server also has
an additional 1GB disk and a PrestoServe board...

We have applied patch 100330-02.beta, which took care of a different kind
of hanging problem, but just recently we started seeing behavior
exactly as you . . .

We'll keep you informed,
Craig
- --
Craig D. Rice UNIX Systems Specialist/Network Analyst
cdr@acc.stolaf.edu Academic Computing Center, St. Olaf College
+1 507 663-3631 1520 St. Olaf Avenue
+1 507 663-3549 FAX Northfield, MN 55057-1098 USA

=====

Vispi,

It sounds like you may be running out of mbufs. There are a few patches
from Sun that address system hang problems.

One way to hang SunOS is to run NetWorker with its "parallelism"
attribute set to a large number (more than 6 or so). If yours is larger
than 4, try lowering it (see nsradmin(4)).

Chris

====

Two things....

Check the amount of swap space available and keep a close eye on it. We've
found that Legato can consume huge amounts of swap especially if you are
backing up multiple clients and have compression enabled. Further, when you
run out of swap, instead of failing Legato seems to just hang around until
more becomes available (and then immediately eats that up as well). Needless
to say, since there is no swap available, you can't get logged in to either
kill the Legato process or temporarily add more swap space... catch-22.

Second, we also found that our server went nuts (and out lunch in most cases)
if we ran too many simultaneous "client" streams. The number of simultanious
processes which can feed the Legato tape driver process can be varied from 1
to 10. We've found that around 3 or 4 everything works fine and no one goes
out to lunch. Below that, the backup takes longer but exhibits less server
and network load. Above that, it trashes both.

Hope this helps.

John Kilheffer
Supervisor, Operations / Workstation Group
AMP Incorporated
amp19263@garfield.amp.com

=====

Ah, but are there any non-standard devices on the machine that hangs?
I'll relate a problem I had, and the solution that worked. I had one of my
IPCs that started hanging every morning in approximately the same way yours
does. It died at 3:17 am when I had that same standard cron job running at the
standard 3:15 am Sun ships with the system. That IPC has an optical disk
drive on it. It seems there was a problem with the file system in the optical
drive. It wouldn't fsck. Nobody was coming to me with problems accessing
their files. The optical drive was not in /etc/fstab, so it didn't try to
fsck it at boot time. However, people could put in a standard optical disk,
which probably stayed in there for weeks (except when the faculty member who
owns the machine came in with a disk from his optical drive at home, and tried
to transfer files). People would mount the disk and do some work for a while,
then go home, have dinner, work from home for a while, etc. At 3:17 am,
when nobody was working, the system would crash. The line you cite from
the cron job checks all mounted file systems for files it should delete.
Unfortunately, if you have a messed up file system mounted, you are dead.

earl smith
earl@cs.columbia.edu

====

i've been using networker for some time over several domains and have not had the problem that you describe; however, when i began using the product i did some experimenting in which i was able to produce symptoms that you describe.

i found that the parallelism factor can cause a great deal of problems if set too high; eg, greater than 4.

i tried running things at all the possible settings and found that numbers larger than 4 (i think this was due to the volume of backups i do - in excess of 65 workstations and 4 LARGE servers) caused so much traffic over the net that the ethernet card on my networker server could not cope with the volume of traffic and basically just came to a screeching halt.

i have about six differnt groups that cover all of these machines. depending on the volume of data that needs to be stored, the time taken to run each of the groups can vary quite a bit. i've had to work hard at tuning my run times for each of the groups when doing full backups so that i do not have too much data attempting to write to the same address (my networker server) on the network at the same time.

while the swap space you have is good, my experience is that the software really needs a great deal of memory to provide any kind of throughput at all. i have 56kb on a sun 4/370 and, for the most part this is ok. my immediate plan is to move networker to a dedicated sparc2 with as much memory on it as i can afford in order to increase my throughput.

you might want to communicate with legato; i've gotten excelent help from their support staff via the phone; they are somewhat slow to respond to mail.

Jim Seavey (510)823-3048 {att,bellcore,sun,ames,decwrl}!pacbell!jwseave
                            or jwseave@pacbell.com

=====

This is a bug in the SunOS 4.0.3 . You should remove it. It only happens in
the Sun 4 family. It's fixed in SunOS4.1.1

- -Jeanne Gee
Lockheed
jeanneg@iscnvx.lmsc.lockheed.com

=====

Hope this information isn't too late, I only just read your mail.

We had a similar problem on our 4/280 running 4.1.1 (we don't have networker
so that wasn't the problem).

The machine would hang during the fast find every night. We could simulate
it by running /usr/lib/find/updatedb manually. We found that it stopped
in one particular directory. When we went to the directory and did an ls
it would also hang. It seemed the kernel got into an endless loop. The
worrying thing was that the file system fscked with no problems.

I've forgotten how I fixed the problem now. I think I dumped, newfs'd and
restored the file system.

Bruce Arden arden@tcom.stc.co.uk
NT Europe Ltd, Welwyn Garden City, England. +44 707 377 277 (x223)



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:17 CDT