SUMMARY: in-memory file corruption

From: Doug Neuhauser (doug@perry.berkeley.edu)
Date: Wed May 29 1991 - 12:15:32 CDT


Sorry that I haven't posted the summary earlier, but I wanted to wait to
ensure (to the best of my ability) that the problem had been indeed solved.
------------------------------------------------------------------------
PROBLEM: Apparent corruption of in-memory copy of files.

Configuration: 4/490 with 4 1-GB IPI disks
                64 MB memory
                4.1.1 generic kernel, NO patches applied
                Using tmpfs for /tmp

SYMPTOMS:
At least two times that I have detected, the in-memory buffered copy of a
file has become corrected. Both times it was detected was because an
executable aborted.

1. /usr/bin/csh gave segmentation fault. Although the dtm of the file
indicated that it had not been modified, I restored a copy from backup. A
cmp of the 2 files gave:
        cmp -l /usr/bin/csh ./csh
        114662 360 0
        114670 360 0
        114678 20 0
        114686 360 0
I moved the "bad" version in /usr/bin to another name and replaced it with
the new "good" version. Several hours later when I compared the files, they
were identical. Some of the diskless SLCs being served from the 4/490
server exhibited the same behavior with csh, others did not. Note that the
bytes that differ were 8 bytes apart.

1. /usr/lang/SC0.0/as died with an illegal instruction. Again, dtm of the
file indicated no modification.I restored a copy from backup, and compared
the files. Again, several bytes were different (with 8 byte offsets):
        cmp -l ./as /usr/lang/SC0.0/as
          8166 46 366
          8174 7 367
          8182 200 360
This time I tried clearing the file system buffer cache by tarring a large
(40 MB file) to /dev/null. After the tar, the files compared as identical.
------------------------------------------------------------------------
SOLUTION:
For once it appears to actually have been hardware. Here is my story:

1. I had run sundiag kmem and vmem tests as well as the limited CPU test
(FPU?). No problems were detected.

1. Sun strongly suggested applying the NFS jumbo patch (100173-03), since
it is reputed to fix some UFS as well as NFS problems. So, I installed:
100173-03: Date: 01/April/91 NFS Jumbo Patch
100174-01: Date: 03-Dec-90 SunOS 4.1.1: fixes for tmpfs bugs.
100259-01: Date: 02/Apr/91 SunOS 4.1.1: ufs_inactive patch
These did NOT solve the problem.

2. I had Sun come out with a new CPU and 2 new memory boards, with the
expectation of swapping them out. Sun software support had no other
suggestions. When we booted the CPU in diag mode to run the extened PROM
diagnostics, the system continually looped, printing the following:
        Boot PROM Selftest.
        EPROM Checksum Test.
        Context Register Test.
        Region Map Write-Write-Read-Read Test.
but before reached:
        Region Map Address Test.
We replaced the CPU board, and executed all of the PROM diagnostics
successfully. The problem has not ocurred in the last several weeks, so I
feel fairly confident that we have solved the problem.
------------------------------------------------------------------------
Suggestions were:
        Sun software bug (likely suspect)
        3-rd party software with privilidges.
        Malicious root user
        bad disk controller (possible suspect - problems only on READS)
        bad CPU board (the REAL problem)
        bad memory boards (likely suspect)
        bad ethernet port
        bad ethernet transceiver
                (probably not either of the above, since the server had
                problems as well as the client).
        tmpfs and NFS/UFS problems
        soft ECC errors
        SCSI cabling problems on SCSI disks
                (not our problem -- we only have IPI disks)
        corrupted shared libraries
                (a distinct possiblity -- I had a SS1 that failed FPU tests
                when shared libc got corrupted.)

Thanks to:
From: "Ric Anderson" <ric@cs.arizona.edu>
From: curt@ecn.purdue.edu (Curt Freeland)
From: feldt@phyast.nhn.uoknor.edu (Andy Feldt)
From: bjk@pecos.rc.arizona.edu (Brian J. Kennedy)
From: John Posey <posey@utdallas.edu>
From: bparent@calvin.UCSD.EDU (Brian Parent)
From: edm@MDI.COM (Ed Morin)
From: bob@omni.com (Bob Weissman)
From: tessi!joey@nosun.West.Sun.COM (Joe Pruett)
From: mp@allegra.att.com (Mark Plotnick)
From: Gerald Justice <justice@dao.nrc.ca>
From: David Stewart <das@edee.edinburgh.ac.uk>
From: sundev!ronin!kevin@Sun.COM (Kevin Sheehan {Consulting Poster Child})
----------------------------------------------------------------
Doug Neuhauser Seismographic Station
doug@perry.berkeley.edu ESB 475, UC Berkeley
Phone: 415-642-0931 Berkeley, CA 94720



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:14 CDT