summary: nfs clients getting hung intermittently (LONG)

From: r.d. parachoniak (rap@physics.ubc.ca)
Date: Tue May 28 1991 - 13:29:07 CDT


Here is a summary of the responses that I received regarding the above
mentioned problem. Unfortunately none of the suggestions so far have
worked for me.

Since a couple of people thought it might be the
bridge that was causing problems, I disconnected the bridge from
the concentrator so that the network consisted of just the clients and
servers connected through the concentrator. The problem persisted so
I reduced the rsize and wsize to 2048 on the home partitions that get
nfs mounted. This also doesn`t appear to have made any difference
as I still get many "nfs server not responding messages" on the clients.
(It just occurred to me that perhaps I should change the
read and write sizes on the root and usr partitions as well)?

We also tried connecting a client and the server up directly (bypassing
the concentrator) but in the short time available to us for testing we
were unable to duplicate the problem. (my "du" test worked no matter
whether the concentrator was connected or not.)

I tried to use vmstat to monitor the paging i/o rates as suggested by
markw@utig.ig.utexas.edu. I haven't been able to do it while the system
is hung but the output on one of the servers *immediately* after the hanging
ends is as follows:

galileo{rap}: vmstat 1
procs memory page disk faults cpu
r b w avm fre re at pi po fr de sr d0 s1 d2 s3 in sy cs us sy id
0 0 0 0 244 0 7 1 0 0 0 0 1 0 0 0 15 21 14 0 1 99
0 0 0 0 244 0 4 88 28 96 0 40 5 18 0 0 381 187 283 0 22 78
0 0 0 0 244 0 3 76 24 76 0 32 15 16 0 0 361 159 269 0 16 84
0 1 0 0 248 0 4 80 40 72 0 27 13 24 0 0 412 136 293 0 19 81
0 1 0 0 248 0 3 84 48 76 0 34 8 20 0 0 440 118 308 0 19 81
0 1 0 0 248 0 3 68 40 64 0 27 20 0 0 0 434 104 340 0 9 91
0 1 0 0 248 0 2 52 32 48 0 21 2 0 0 0 356 92 287 0 1 99
0 1 0 0 248 0 2 40 32 44 0 16 15 0 0 0 317 83 246 0 7 93
0 1 0 0 248 0 8 56 32 60 0 20 13 22 0 0 317 76 254 0 11 89
0 1 0 0 248 0 9 44 24 64 0 22 10 30 0 0 327 72 208 0 3 97
0 1 0 0 248 0 7 32 16 48 0 17 0 0 0 0 274 67 168 0 1 99
0 1 0 0 248 0 5 24 12 36 0 13 0 0 0 0 219 63 136 0 1 99
0 1 0 0 248 0 4 16 16 36 0 10 11 4 0 0 206 60 121 1 8 91
0 1 0 0 248 0 3 36 12 44 0 14 0 28 0 0 225 57 154 0 20 80
0 1 0 0 248 0 2 32 16 36 0 12 0 1 0 0 229 55 152 0 1 99
0 1 0 0 248 0 1 24 12 28 0 9 0 0 0 0 184 53 124 0 1 99
0 1 0 0 248 0 0 16 8 20 0 7 0 0 0 0 147 52 101 0 1 99

Does this suggest anything abnormal?

Any other suggestions or hints would be appreciated.

Many thanks to the following people for their suggestions:
stern@sunne.East.Sun.COM
gmc@premises1.quotron.com
daryl@oceanus.mitre.org
hwa@renesse.sscnet.ucla.edu
markw@utig.ig.utexas.edu

#### my original posting ####

Problem: Very poor nfs performance, clients get hung for anywhere from
          a few seconds to a few minutes when running a large application
          program (Arbortext Publisher, it may happen with other programs
          too but this is almost the only one being used on these machines).
          By hung I mean there is no response to keyboard input. Mouse
          moves but does not respond to buttons. While client is hung,
          server is typically idle without anyone even logged in.

          In trying to come up with a repeatable scenario where this happens,
          what I did was open 3 windows on the client and issued a "du" in
          each one.
          The du processes produced output in each window but while they
          were running they could not be interrupted, ie a ctrl-c was
          completely ignored! While the du processes were going on the
          Publisher job in another window was completely frozen.

The setup:
                        connection to rest of campus
                                |
                                |
                        -----------------
                        | bridge |
                        -----------------
                                |
                        -------------------
                        | concentrator |
                        -------------------
                        | | | | | | |
                        | C1 C2 C3 | C4 C5
                        | |
                       S1 S2

- S1 and S2 are SS2's with 16Mb memory and attached disks (Wren IV's)
  They are running SunOS 4.1.1 with the following patches installed:
        100173-03 - NFS Jumbo patch
        100232-01 - Watchdof reset, etc.
        100125-02 - in.telnetd replacement (security)
        100224-01 - /bin/mail replacement (security)
  They do have a custom kernel installed which just limits the number
  of device drivers and installs above patches.

- C1 to C5 are diskless SLC clients (3 on S1, two on S2)

- All connections are via twisted pair through transceivers at the suns.

- netstat -s shows that there are no socket overflows.
- netstat -i shows < 0.015% error rate.
- all machines have only one ethernet interface.
- in.routed is not running
- a typical nfsstat -rc on one of the clients yields:

Client rpc:
calls badcalls retrans badxid timeout wait newcred timers
69335 71 4066 4 4134 0 0 3367

This would seem to indicate packets are being lost between the client and
the server (badxid is small). I don't know why timers is so large since
the client and server are on the same logical network (ie dynamic
retransmission should not be enabled).

- typical nfsstat -m yields (in part)

/home/galileo from galileo:/home/galileo (Addr 128.xxx.yyy.zzz)
Flags: hard read size=8192, write size=8192, count = 5
Lookups: srtt=9 (22ms), dev=6 (30ms), cur=4 (80ms)
Reads: srtt=10 (25ms), dev=6 (30ms), cur=4 (80ms)
Writes: srtt=37 (92ms), dev=6 (30ms), cur=7 (140ms)
All: srtt=10 (25ms), dev=6 (30ms), cur=4 (80ms)

While trying to figure out what might be going on, we connected a bridge
between one of the servers and the concentrator and the problem seemed
to go away (we are doing more testing to confirm this; in my du test
I was able to crtl-c out of the du process).
Would this indicate that packets are getting mangled by the server?

We also tried connecting a client and a server directly through a
delni box but that didn`t appear to do any good (ie would seem to
indicate that the concentrator is not the problem).

Anyone have any ideas as to what might be going on here? Sorry if my
explanation of the problem is somewhat longwinded and unclear but I'd be
glad to supply any further details that might be relevant.

##### here are the responses I rec'd. Thanks again to all of you. #####

From: <stern@sunne.East.Sun.COM>
To: <rap@physics.ubc.ca>
Subject: Re: nfs clients getting hung intermittently

your explanation of your problem and configuration was among the
best i've ever seen on sun-managers, and it leads me to a guess
as to your problem:

i'll bet your bridge is saturated or almost saturated and is
dropping parts of long read/write trains between client and
server. to test this out, reduce the rsize/wsize on the clients
to 1024 or 2048 and see if the retransmissions go away.

when the bridge is not filtering decnet/appletalk, your small
network is overloaded: think of what the bridge is doing on
the server side -- it must really be filtering like crazy.
spending that much effort on packet sorting, it probably can't
keep up with the burst rates generated by client or server.

dynamic retransmission won't be enabled since client/server
are on the same IP network, but the timers still go off to
gauge the "expected" values. the average/current/expected times
are still maintained, even if dynamic retransmission/resizing
isn't done. that's why you see the timers field increasing.

if bridge dropouts are really your problem, that would explain
why you hang up for minutes at a time: the retransmission cycle
may go through several minor and major timeouts before you
finally get enough "breathing room" to get a request through
to the server and a response back to the client. if you use
the default NFS RPC timeout of 0.7 seconds, with retrans=5,
then your first major timeout cycle is about 11 seconds, and
the second one runs about 30 seconds.

shameless plug: o'reilly & associates has just published
my book on NFS and NIS, which covers this kind of problem
in quite a bit of detail. there are 3 chapters on network
tuning and debugging. i'll probably send info to sun-managers
about it later this week, if i remember

--hal stern
  sun microsystems
  northeast area consulting group

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
From: <gmc@premises1.quotron.com>
To: "r.d. parachoniak" <rap@physics.ubc.ca>
Subject: nfs clients getting hung intermittently
References: inbox:103

You didn't say where your server was located, but if it is on the
other side of the bridge, I would suspect that it is the culprit. I
have seen flaky or underpowered bridges cause the behavior you are
experiencing. Try installing a different bridge with more horsepower
(i.e., packet forwarding rate), if possible, and see if that helps.

Greg
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
From: <daryl@oceanus.mitre.org>
To: <rap@physics.ubc.ca>
Subject: Re: nfs clients getting hung intermittently

Ron,

I'm not much help in the NFS problem, but we have had a lot of experience
with Arbortext's "The Publisher". I've seen periodic slowdowns and momentary
"hangs" when running the SunView version of 'publisher' under OpenWindows.
The X11 version 'xpublisher' runs better under OpenWindows but the X11
version of 'pubdraw' is clumsy so I don't use it. We've had best success
with 'publisher' under SunView instead of OpenWindows.

We've found that 'pubdraw' with lots of text gets very slow during
magnification changes. Sometimes a user with a very large document forgets
that in our system, the publisher "autosave" feature is enabled by default
and it may take a minute or two every 15 minutes to checkpoint the document.

Our department at Mitre has used "The Publisher" for several years (since
revision 2.0). Despite initial grumblings about bugs and inconsistencies
in the interface, we have become dependent on publisher for almost all our
document production (letters, memos, technical papers, technical reports, books
, viewgraphs, notices, diagrams, etc.). The secretaries have all
had instruction classes from Arbortext and we are now smoothly "publishering"
along. Version 3.4a works very nicely.

I'm very interested in hearing success or horror stories about "The Publisher".
I'm especially interested in problems that cause the slowdowns and hangs.

If you find a real NFS problem and/or solution, I'd be very interested.

        Daryl Crandall
        The Mitre Corporation
        daryl@mitre.org
        (703) 883-7278

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

From: <hwa@renesse.sscnet.ucla.edu>
To: <rap@physics.ubc.ca>
Subject: PUBLISHER on SLCs

I got the similar problem on our SLC, I fixed it after I did following...
        1. Reconfigure SLC's kernel. (The size now is 1100blocks).
        2. Reconfigure boot machine's kernel get rid of uneeded
           code. ()It's an IPC)
        3. Increase nfsd to 16.
        4. The nfsd in item 3 is for server.
I hope this will help.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
From: <daryl@oceanus.mitre.org>
To: <rap@physics.ubc.ca>
Subject: Re: nfs clients getting hung intermittently

Ron,

A matter of tuning comes to mind. Have you modified the GENERIC kernal
of your server to accomodate extra demands of the nfs daemons?

How many 'nfsd' daemons get started by /etc/rc.local.

What is your configuration?

        # of servers
        type of server (Sun4, Sun3, SPARCstation1, SPARCstation2)
        type of clients (architecture, diskless, dataless)
        amount of memory in server
        amount of disk on server
        type of disks on server (SCSI, SMD, IPI)
        total # of nodes on your ethernet

We had NFS problems until we tuned the system by increasing the MAXUSERS
variable in the kernal configuration file /sys/sun4/conf/GENERIC (or whatever)
Our big servers (Sun4/390) have 64 or 128 MAXUSERS. This gives extra
table space for the extra processes that run on a busy server.

We also modified a variable in the kernal using the 'adb' debugger to
give us more table space for something. This made a dramatic difference.

We also started more NFS daemons 'nfsd' to accomodate the large number
of diskless clients we have. we run 32 daemons on a server with 24 diskless
clients.

If these ideas haven't already been tried. I can send you details of
what we did and I can send you a copy of the Sun document that identifies
them.

We went from an average load of 16 to an average load of 3 which was
a BIG improvement. I'm a real guru around here now. Your problems may fit
right into this situation.

        - daryl -

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
From: <markw@utig.ig.utexas.edu>
To: <rap@physics.ubc.ca>
Subject: sun-mgrs: nfs hangs?

>Posted-Date: Thu, 23 May 91 16:52:41 CDT

You seem to believe that your problem is network related.

I have observed unexplained hangups when io/paging rates get very
high on several of our systems running 4.1 or 4.1.1, but not 4.03.
I wonder what the output of "vmstat 1" sez on the server just before
such hangups?

mw
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:14 CDT