SUMMARY:out of mbufs, ethernet collisions, 3/280 server panic, stumped

From: James Ganong (jeg@ced.berkeley.edu)
Date: Fri Oct 18 1991 - 00:22:43 CDT


Conclusion: upgrade to sunos 4.1.1, and see what happens.

This was suggested by stern@sunne.East.Sun.COM (Hal Stern - Consultant)
and rusty@groan.Berkeley.EDU (Rusty Wright).

All of the suggestions I got would have been great to know when the
network was getting hammered, so I could try to find out why it was
happening. (Un)fortunately, my network as been acting better in the
last week. Because all the suggestions were good, and I have not been
able to test them very well, this summary is long, maybe it will
provide useful hints to somebody facing the same problem. Here are
the suggestions people made, and at the bottom notes from Hal Stern
with several ways to try to diagnose and treat the problem, and my
original posting.

poffen@sj.ate.slb.com (Russ Poffenberger) and "Andrew Luebker"
<aahvdl@eye.psych.umn.edu> warned me about machines using the wrong
broadcast address causing a broadcast storm. I don't think any of
the unix machines on my network could have done this because they
all use yp, and they have to have the right broadcast address to function
at all.

erueg@cfgauss.uni-math.gwdg.de (Eckhard Rueggeberg) suggested that I
try programs like etherfind -v to see what processes and sockets are
involved. Once I had rebooted all the workstations and the network
was quiet it was too late to see what had been hammering it, but I
found something that may be useful to other people with sgi's on their
network. We have one mail spool directory, which is nfs mounted on all
the machines. The sgi users like to use a program called mailbox,
which is a very fancy sort of biff. It checks for mail very
frequently, and each time it does it results in a bunch of nfs
packets. Joyce Richards at SGI told me how to make it check every 10
seconds instead. On the sgi's /usr/etc/inetd.conf, add -t 10 at the
end of the line that reads:
sgi_fam/1 stream rpc/tcp wait root /usr/etc/fam famd

scs@lokkur.dexter.mi.us (Steve Simmons) and frutig@rdc.puc-rio.br
(Marcello Frutig) both said they had seen this problem before because
of bad ethernet hardware, usually a bad transciever, once a bad BNC
connector, and once a piece of CATV cable put into a thinnet.
Perhaps I should suspect the network hardware instead of
the software, but because of my lack of knowlege below the "gee it is
plugged in and the little light is blinking" line, I am going to try
to fix the software first and see if that resolves it.

matt@wbst845e.xerox.com (Matt Goheen) told me another possible reason
for the network gettting pounded with sunos 4.0.3: ...when a program
running on a NFS client had its executable over written (i.e. by a
"make") -- the program would go into some sort of intensive loop and
pound the server with ethernet packets....This was fixed in 4.1 (now
the program gets a "bus error" when the executable changes out from
under it)....

After posting my question, I found

Patch-ID# 100126-05
Synopsis: SunOS4.1 SunOS4.1_PSR_A SunOS4.1.1 SunOS4.0.3 MBUF PATCH
Topic: Doubling MBUF size

I applied this patch six days ago, and have had one crash.
which looks like another problem with the ethernet. I am not
worrying about it now, because I am about to go to 4.1.1 and see if
it repeats. But just in case somebody is curious:

ie0: cmd not accepted
panic: iechkcca
% adb -k vmunix.7 vmcore.7
$c
physmem 3fe
_panic(0xf08ef4a) + 44
_iechkcca(0xf0b06f0) + 6c
_iecustart(0xf0b06f0) + 1c
_iedog(0xf0b06f0,0x0) + 82
_softclock(0x2000) + 84
_hardclock() + 288
level5(?)
_sleep(0xf08fe0c,0x0) + 90
_sched() + b8
_main() + 20a

Here are Hal's Notes:

From: stern@sunne.East.Sun.COM (Hal Stern - Consultant)

mbufs are "envelopes" that are used to wrap up an ethernet packet
while it is passed up- and down-stream in the TCP/IP (or other)
protocol stack. in 4.0.3, there are a fixed number of them, and
it's possible to run out.

you run out when your input load exceeds the machine's ability to
drain it. this is why rebooting everything makes the problem vanish:
you remove the source of the input load, make everything get quiet,
and there's no more traffic to handle. but as soon as you start
having bursts of traffic, you'll being using up mbufs again.

i assume you're running NFS on this machine, otherwise there wouldn't
be any reason to have such a high volume of packets thrown at it. if
the server's response time is long, and the clients are retransmitting
requests, they'll just add to the problem. furthermore, in 4.0.3,
there's no duplicate request filtering, so retransmitted requests
will pile up on the server (and consume more mbufs).

suggestions:
(a) upgrade to 4.1.1. it has a better ethernet driver, mbufs
        are dynamically allocated (and there's more headroom),
        duplicate NFS requests are flushed, and it's generally
        an easier system to fix.
(b) barring that, look at your ethernet input errors (using
        netstat -i). if your input error rate > 0.025% (yes, that
        small) then you should increase the input buffer space;
        instructions for doing so are attached
(c) how many nfsd daemons are you running? you may want to run more;
        if your requests are piling up because they can't be serviced,
        add more nfsd daemons. check with netstat -s to see if
        you are dropping UDP requests (look for socket overflows).
        if so, add nfsd daemons.

--hal stern
  sun microsystems
  northeast area tactical engineering

Here are the instructions Hal mentions above. He noted that 'they
only apply to 4.0.3, really, since these values are "exposed" in 4.1
in /sys/sunif/ie_conf.c'

The default number of ie ethernet receive buffers provided
in SunOS 4.0.x may be inadequate for some Sun-3 NFS servers
and cause some degradation in server NFS performance.

We suggest that customers who have Sun-3 servers running
SunOS 4.0.x do the following simple check of the ie ethernet
input error rate and and apply the patch below if applicable:

Determine the percentage of ie ethernet input errors
by issuing the command:

        netstat -i

and dividing the number of received packets ("Ipkts") by the
number of input errors ("Ierrs"). If this input error rate
is higher than .025% (.00025) you should consider experimenting
with the number of ie receive buffers to see if increasing
the number of receive buffers will decrease this input error
rate.

In SunOS 4.0.x, the ethernet driver parameters are not
generally configurable so you'll have to use 'adb' to modify
the on-disk version of /vmunix and reboot. In SunOS 4.1,
these parameters are in /sys/sunif/ie_conf.c and /sys/sunif/le_conf.c

Make a backup copy of /vmunix before proceeding by:

        cp /vmunix /vmunix.old

IE Ethernet Driver Parameters:

Name Default Comment
ie_rbds 10 receive buffer descriptors (tiny)
ie_rfds 9 receive frame descriptors (tiny)
ie_rbufs 25 receive buffers (~1500 bytes each)

Try increasing ie_rbds to 20, ie_rfds to 19, and ie_rbufs to
40 and see if this decreases the ie input error rate. Note:
ie_rfds must always be less than ie_rbds, and ie_rbufs must
always be greater than to ie_rbds.

        adb -w -k /vmunix /dev/mem
        ie_rbds?W 14 <-- 14 hex is 20 decimal
        ie_rfds?W 13 <-- 13 hex is 19 decimal
        ie_rbufs?W 28 <-- 28 hex is 40 decimal
        <control-D>

Reboot and run this modified /vmunix for a few days to see
if this change decreased the rate of ie ethernet input
errors.

If the new ie ethernet input error rate is not significantly
different from the previous value, the input errors are being
caused by some other problem, probably electrical in nature.

note that in 4.1 systems, there are "high" and "low" values;
the "high" values are used if multiple ethernet interfaces are
present.

here is by original posting:

From: jeg@ced.berkeley.edu (James Ganong)
To: sun-managers@eecs.nwu.edu
Subject: out of mbufs, ethernet collisions, 3/280 server panic, stumped

i am stumped by one of those problems that seems to disappear whenever
i get a hold of it, so i would appreciate any suggestions of what to
do to flush it out. at the end of this message is a traceback of the
last panic.

i have a sun 3/280 running sunos 4.0.3, and a mixed network with about
150 computers of varied types, sun 3's, a sparcserver, sparcstations,
iris'es, ibm pc's, vaxes, macs...all on one ethernet. yesterday
it did a panic with a message about out of mbufs.

the ethernet collision light was flashing almost continuously when
the 3/280 was plugged into the ethernet, and was mostly quiet when
it was unplugged.

sun service came out and replaced the cpu and memory boards, and we
went back to using the geneic kernel. it kept crashing.

we turned off a bunch of workstations, and turned off the delni to
the iris'es. the collisons quieted down. when we turned these things
back on the collisions started again, and the 3/280 crashed, but
it was never reproducible.

we ran traffic, and some of the workstations were constantly sending
alot of packets to the 3/280, so we rebooted all the workstations.
since we rebooted all the workstations the problem has put its
cloaking device on. we have been running for 6 hours with no crashes.
There are few ethernet collisions (about .005 Collis/Ipkts). I tried
to stimulate a crash buy running xfish and having 30 background
processes running on the irises cp to a disk mounted from the 3/280,
it made the ethernet collision light shine, but no crash.

I guess the workstations were hosing the 3/280 with requests.
I want to try to figure out why the 3/280 crashed in the first place.

I can probably get the network guys on my campus to come out with a
sniffer. Do you think this would be useful to do? Maybe I should
upgrade the 3/280 to sunos 4.1.1?

thanks!
--james ganong

message printed when it hung:

ie0: no carrier
ie0: out of mbufs: output packet dropped

message printed when it crashed:

ie0: no carrier
nfs_server: bad sendreply
rpc.etherd:
trap address 0x8, pid 199, pc = f0386d2, sr = 2300, stkfmt b, context 5
Bus Error Reg 80<INVALID>
data fault address f809400 faultc 0 faultb 0 dfault 1 rw 1 size 0 fcode 5
KERNEL MODE

...numeric traceback deleted...

lands.ced.berkeley.edu# adb -k vmunix.6 vmcore.6
$c
physmem 3fe
_panic(0xf0b72ae) + 44
_trap(0x8) + 1f0
trap(?)
_mclget(0xf806380) + a
_copy_to_mbufs(0xf1c5162,0x5dc,0x0) + e0
_ieread(0xf0f9128,0xf1c0380) + 1d2
_ierecv(0xf0f9128) + 34
_ieintr(0xf0f9128) + 96
_iepoll() + de
level3() + 1c
data address not found



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:18 CDT