SUMMARY: Sun 4/670 MP crashes

From: William K.P. Chan (wkpchan@csd.hku.hk)
Date: Sun Apr 04 1993 - 16:07:46 CDT


We apologize for the late summary since we have engaged in trying which
is the correct solution. Unfornately, we have not arrived at any
conclusion yet. The temporary solution was as follows.

We have applied three patches to the 4.1.3 kernel and don't let the file
system full by montioring the disk usage. So far there is no crash
after applying the patches. We need sometime to see if the system goes
stable.

My orginal posting:

>Dear Sun managers,
>
>Our Sun 4/670 MP has crashed several times recently. The crash can
>occur at any time and causes the server hanging. We suspect that this
>may be related to a full file system becuase there are "file system
>full" messages logged by the console before crash.
>
>The crashes have occurred more frequently (10 times!) since March when
>the SBus Buffered Ethernet/Fast SCSI Card was connected to the machine
>and the OS version was upgraded to 4.1.3 from 4.1.2. We have reported
>this problem to the Sun VAR personnels and their customer support
>engineers have investigated the problems and replaced one of the CPU
>board a few weeks ago. But the same problem persists.
>
>Technical information on the problematic system are as follows:
>
>Sun SPARCserver 670MP M120
> 2 processors
> 128 MB RAM
> 5 fixed disks attached
> four on the onborad SCSI port (each of capacity from 1 to 1.4 GB)
> one on the new BEFS card (2.1 GB)
> built-in CD-ROM
> 2.3 GB tape backup
>
>Operating system: SunOS 4.1.3 with no patch
>Network services: NFS, NIS, DNS, print and mail server
>Telnet clinets under normal use: 40 - 70
>System load: 10 - 20
>Network client operating systems:
> SunOS 4.1.1, 4.1.2
> AIX 3.1.5, 3.2
> HP-UX 7.0, 9.0
> IRIS 4.0.2, 4.0.5
>
>Other systems on the network:
> 35 Sun SPARCstations IPC
> 20 Sun SPARCclassic (added a few days ago)
> ~35 others including HP, IBM, SGI
> 100 PC compatibles and Macintoshes
>
>The following were console messages logged during system crashes.
>------------------------------------------------------------------------
>Mar 1 13:29:28 sunmp vmunix: panic on cpu 0: free: freeing free block
>Mar 10 10:03:19 sunmp vmunix: panic on cpu 1: error in swapping in u-area
>Mar 17 16:24:09 sunmp vmunix: panic on cpu 1: Data fault
>Mar 19 13:25:45 sunmp vmunix: panic on cpu 1: alloccgblk: cyl groups corrupted
>Mar 22 17:23:09 sunmp vmunix: panic on cpu 0: mapsearch: map corrupted
>Mar 22 21:58:02 sunmp vmunix: panic on cpu 1: mapsearch: map corrupted
>Mar 23 20:12:21 sunmp vmunix: panic on cpu 0: mapsearch: map corrupted
>Mar 23 20:23:06 sunmp vmunix: panic on cpu 1: ifree: freeing free inode
>Mar 24 14:59:55 sunmp vmunix: panic on cpu 0: alloccgblk: can't find blk in cyl
>
>=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>William K.P. Chan Email : wkpchan@csd.hku.hk
>Department of Computer Science Tel : (+852) 859 2187
>The University of Hong Kong Fax : (+852) 559 8447
>=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

The suggestions and proposed solutions include:

1. The Fast SCSI/Buffered Ethernet card might be faulty. Replace the card.

2. It is caused by an inconsistent filesystem. Other panics suggest
    something in the filesystem itself is corrupted, you should bring
    the system up single user and fsck all you filesystems.

3. The parent directory of the swap file directoties was exported
    without root permission.

4. The problem is related to quotas.

5. Applying the following patches:
        kernel jumbo 100726-02
        NFS jumbo 100173-09
        UFS jumbo 100623-03 (i.e. free: freeing free block)

6. Double check cables/terminators looking for bent/missing pins.

7. Check with the disk vendor for firmware upgrades for your disks.

8. Move another one or two disks onto the second scsi interface to
    reduce the total cable length.

9. Check you partitions carefully. It look like there is overlap on
    your disk partition.

10. UNIX is not happy when it is using file systems that are very full.
    Buy a new disk.

Thanks for those who responded.

From: David Wiseman <magi@csd.uwo.ca>
From: David Wiseman <magi@csd.uwo.ca>
From: richard@langeoog.gmd.de (Richard Czech)
From: arossite@us.oracle.com (Bruce Rossiter)
From: Dave Mitchell <D.Mitchell@dcs.sheffield.ac.uk>
From: Christian Lawrence <cal@soac.bellcore.com>
From: andrico@parcplace.com (Liesl Andrico)
From: hedrick@klinzhai.rutgers.edu (Charles Hedrick)
From: ups!upstage!glenn@fourx.Aus.Sun.COM (Glenn Satchell - Uniq Professional Services)
From: cedept@tgsj.TriGem.co.kr (CE DEPT)
From: andrico@parcplace.com (Liesl Andrico)
From: c3314jcl@mercury.nwac.sea06.navy.mil (Johnson Lew)
From: Hongchao Dong <hdong@crhc.uiuc.edu>
From: sven@alkestis.mpim-bonn.mpg.de (Sven Maurmann)

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
William K.P. Chan Email : wkpchan@csd.hku.hk
Department of Computer Science Tel : (+852) 859 2187
The University of Hong Kong Fax : (+852) 559 8447
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:41 CDT