SUMMARY: NFS Corrupting Files

From: Dave Yearke (yearke@hercules.calspan.com)
Date: Thu Oct 13 1994 - 22:17:56 CDT


I received a few suggestions, the most common one being to turn on UDP
checksums. I will probably try this, as it is a good way to detect the
problem, but I am bothered as to why the problem is occurring in the first
place. I have checked my cables, both physically and with a Time Domain
Reflectometer, and they are fine. One thing I've noted is that since we
took the suspect SBus IRIG board out, we haven't seen the problem. We're
going to put it back in and see if the problem appears again. I am also
going to upgrade my clients to 4.1.3_U1 (Solaris 2.x isn't an option at
this time), which is something I've wanted to do for a while, to eliminate
the possibility of some old NFS bugs. (Unfortunately, the first one I
upgraded is now panicking every time we run our application. Problems,
always problems ...)

Thanks to everyone for your help.

================================================================================

Original Posting:

> We have a subnet consisting of a SPARCstation 2 running SunOS 4.1.3_U1,
> 6 Sun 4/110s running SunOS 4.1.1, and 4 SPARCstation 2s running SunOS 4.1.1.
> The 4.1.3_U1 SS2 acts as a gateway to the subnet via an SBus Ethernet card,
> and is connected to our backbone using the built-in Ethernet interface.
>
> My problem is this: From time to time, the clients on the subnet are
> receiving corrupted files from NFS-mounted disks, with no error messages
> or other indications that this is happening. For example, I can run the
> "sum" command on the "tcsh" binary, and it will report a different result
> on every machine, even though it's the same copy on the same NFS-mounted
> disk! This behavior occurs with NFS disks that are mounted from the
> gateway, as well as disks that are mounted from systems on the backbone.
> It's as if the gateway is garbling the packets before they are sent out
> the SBus Ethernet interface to the subnet.
>
> The only way to solve the problem is to power-cycle the SS2 gateway. Even
> a hard reboot doesn't cure it. I should mention that we have one other
> piece of hardware in the SS2, an SBus IRIG board with a custom driver
> that we wrote.
>
> Is it possible that the SBus IRIG board, or the driver, is interfering with
> the SBus Ethernet card, and causing it to scramble data? Or, is the IRIG
> board irrelevant, and is something else the cause? In either case, why
> wouldn't we get any error messages? I have tried turning on UDP checksums
> in the kernel, and that hasn't changed anything.
>
> Any solutions or suggestions are greatly appreciated.

================================================================================

Replies:

>From: Pell Emanuelsson <pell@lysator.liu.se>

I have seen this where the physical network was flaky, i.e. loose, badly
crimped contacts, etc. Try to find the cabling problem. If you don't, you
can as a last resort turn on UDP checksumming, which cures the symptoms
(corrupted files) but not the cause itself. It says in answerbook (or
sunsolve?) how to do this.

>From: perryh@pluto.rain.com (Perry Hutchison)

You would need to enable UDP checksums on the NFS servers and all clients
in order for them to detect this sort of corruption. I believe you would
not need to enable them on the gateway in order to have protection for
mounts passing through it -- the packets should just get forwarded
without being passed up and back down the protocol stack -- and they
might not help with the gateway's own disks if the gateway is in fact
the source of the problem.

>From: Andreas Holz <Andreas.Holz@oi32.kwu.siemens.de>

Hello,

        we had the same problem in our informatic-cip-pool using a great
        cluster of sun-workstations and a CDC-Epix as ftp-archive and
        news-server. Retrieving files from the EPIX via NFS produces
        corrupt files.

        I'm not very familiar with the detailed problem but I can remember,
        there must be a problem with the udp-checksums. In normal case, on Sun-
        workstations udp-checksums will not be created. If You receive a
        wrong udp-packet, there will be created no error.

        You have to specify to generate checksums during configuration
        of the kernel.

        See this information as a hint and please verify it by asking
        Sun or make a contact to the administrators of the cip-pool
        at the Erlangen University (Germany)
        (e.g. problems@cip.informatik.uni-erlangen.de).

>From: stern@sunrise.east.sun.com (Hal Stern - NE Area Systems Engineer)

there are about a half-dozen bugs in 4.1.1 that cause confusion on
the *client* side. the data is valid on the server, your view is
confused on the client -- get the NFS jumbo patch for your os
version.

>From: Dan Stromberg - OAC-DCS <strombrg@bingy.acs.uci.edu>

You may want to turn on UDP checksumming, among other things. This
should increase error checking on SunOS 4.1.x machines. It's already
on by default, on SunOS 5.x machines.

A good tool for tracking down problems, is "cping" off the net. It's
readily available for 4.1.x; I have a version I ported to 5.x. It
does pings repeatedly, outputting a "." for a dropped packet, and a
"!" for a good packet. You can then ping from/to key machines, to
locate a point of failure quickly - and relatively intermittent
problems are much easier to find with cping than ping.

>From: root@ewi.ch (### SUPERVISOR ###) (Christoph Rothlin)

Maybe it's not the same, but:

We had (have) similar problems with NFS. We lost the end of drawing-files
of the CAD-System MEDUSA. Most often the last bytes have been 0 (zero).

We couldn't read the drawings any more. The strange thing is: it only happens
to Medusa CAD-Drawings. We're running ORACLE, GIS, AutoCAD and other SW on the
same NFS-Server (SS 4/690 MP2; SunOs 4.1.3_U1). We had many meetings with
Sun Switzerland and ComputerVision (Developer of MEDUSA) and didn't find any
hints.

ONLY ONE ! Use Solaris (SVR4) instead of SunOS ! We moved all datas from
the 4/690MP2 to the SC 2000 (Solaris 2.3) and since then .... we never lost
any bit any more !

All our servers are on a backbone. The backbone is connected to several sub-
nets via 2 routers (CISCO). On the other end we've Sun SPARCstations. For
technical reason we have to use multiport-repeater on each subnet. So the
full path locks like:

        server -> router -> multiport-repeater -> sparc-station

Sorry. I know this is not a answer and of corse not a solution. But maybe you
can use this information for any purpose.

================================================================================

-- 
                      Dave Yearke, yearke@calspan.com



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:09:11 CDT