Re: High Load Average (SUMMARY)

From: Kevin W. Thomas (kwthomas@nsslsun.gcn.uoknor.edu)
Date: Wed May 08 1991 - 20:52:03 CDT


Earlier today, I asked:

>System: 4/380
>OS: 4.1.1
>
>Patches:
>
>100173-03 NFS Jumbo Patch.
>100174-01 Fix tmpfs bugs.
>100188-01 TIOCCONS bug.
>100192-01 Fix color problem (white on white instead of white on black).
>
>Problem:
>One of our users reported the CONSOLE froze up early this morning while using X
>(MIT version). He tried L1-A, and nothing happened. He had to the power off
>and on to get the system to reboot.
>
>Later this morning, I started up X, and xterm'd a few windows, when the same
>thing happened to me. I tried *lots* of L1-A's, and after a while, I got a
>response. I tried continuing things, and see what would happen. However, the
>system gave me abort messages when I tried to continue. So, I decided to force
>a dump. I was able to bring up the system in single user mode, and save the
>dump to a file. I ran a "ps" command on the dump and found 28 actively running
>processes, including:
>
> swapper ypserv in.routed syslogd
> nfsd update cron

After a few more occurrences, I remember that there had been some mention of
a high load average problem on a previous sun-managers posting. Upon checking
around on my system, I found the following:

>Date: Wed, 21 Nov 90 17:11:21 EST
>From: Kennedy Lemke <Kennedy_J_Lemke@princeton.edu>
>Subject: Load peaking problem
>
>We've been having a strange problem; I hope someone else has
>seen this and knows how to fix it: we have a Sun 4/490 with
>one local IPI disk attached (a 1.2 GB CDC 9720). We are running
>SunOS 4.1. On this system, we have on the average about 50
>users running mostly interactive programs, with a few long-running
>number-crunching programs as well. We have a total of around
>3000 users, whose files live on a server machine.
>
>Soon after we made the system available, we started seeing this
>problem: occasionally, the load average of the machine shoots up
>very high (to anywhere between 10 and 100--we see this clearly
>with xnetload). The load stays high for perhaps 10 to 60 seconds,
>then returns to normal almost as quickly. During the time that
>the load is rising, all processes on the machine seem to be "hung".
>For example, if I press "return" at a prompt, nothing echoes on
>my screen, and I don't get another prompt.
>
>Once when this was happening, we halted the machine and got a
>core dump (with "g 0"). We examined the processes from the dump,
>and sure enough there were about 70 runnable processes (with an
>"R" in the state column from ps).
>
>After awhile, a user noticed that this seemed to be happening
>whenever he did "ls -l" on /dev. We confirmed that this was the
>problem. trace showed that the machine was hanging when stat(2)
>was called with /dev/id000b as the first argument; we do paging
>to the local IPI disk, of course on this partition.
>
>So today I brought the machine down, removed id000b, and did
>"MAKEDEV id000" (creating a new id000b node), but I get the
>same results.
>
>Have any of you experienced a similar problem? Anybody suspect
>this is a hardware problem? We have not noticed any problems with
>paging activity and the like--this all seems normal. This only
>occurs when stat(2) is called on /dev/id000b (and only when the
>machine is in multiuser mode).

The summary was:

>Date: Thu, 29 Nov 90 00:18:22 EST
>From: Kennedy Lemke <Kennedy_J_Lemke@princeton.edu>
>Subject: Re: Load peaking problem (SUMMARY)
>
>About a week ago I posted a query about the load average on my
>Sun 4/490 going out of control whenever stat(2) was called on
>/dev/id000b. I received 5 responses to the query with various
>good advice (installing the NFS jumbo patch, which I had done,
>installing the PMEG patch, which I hadn't done, increasing the
>number of maxusers, etc.).
>
>The easiest and most obvious solution came from trinkle@cs.purdue.edu
>who suggested simply removing the device node altogether (which
>I didn't know I could do). I did so, and now I haven't seen the
>problem since. I don't know the exact cause of the problem, nor
>the "perfect" fix, but this has done the trick for us.
>
>Perhaps this problem won't appear in 4.1.1 :-) [and maybe it will...kwt]

It then occurred to me that I made some extra tty and pty devices yesterday
as we seemed to be running low. I didn't check to see if the kernel was
configured to use them. I accessed them via stat(2), which at least didn't
hang in 4.1.1. However, when I ran X, and the device files were there, I would
always hang the system. When I removed the device files, and ran X, I would
have no problem.

Whether it was X itself or an application that was trying do something with the
ttys? or ptys? devices, I don't know.

If there is an OS patch to avoid this problem, I'd appreciate hearing about it.

        Kevin W. Thomas
        National Severe Storms Laboratory
        Norman, Oklahoma



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:13 CDT