SUMMARY: Are my IPI's dieing?

From: B.Rea@csc.canterbury.ac.nz
Date: Tue Feb 11 1992 - 22:16:14 CST


I got the following errors on a 490 with 2x1 Gb 3Mb/sec IPI disks, 64 Mb
ram with SunOS 4.1.1. Both id000b and id001b are swap partitions. The
machine at the time was being beaten to death by a Maple user - making heavy
use of the swap space. The errors did not happen again so hopefully we
are in the clear. We have also just upgraded to a 690MP.

Feb 7 08:30:40 cantua vmunix: idc0: ctlr message: 'panic: freebqe:
nonzero ref count '
Feb 7 08:30:40 cantua vmunix: idc0: ctlr message: 'Did panic dump to drive 0 '
Feb 7 08:31:50 cantua vmunix: ipi 1: missing interrupt. refnum 38e
Feb 7 08:31:50 cantua vmunix: id001b: block 81168 (114378 abs): write:
missing interrupt - attempting recovery
Feb 7 08:31:50 cantua vmunix: ipi 0: missing interrupt. refnum 38d
Feb 7 08:31:50 cantua vmunix: id000b: block 92000 (125210 abs): read:
missing interrupt - recovery in progress

[several more missing interrupts]

Feb 7 08:31:50 cantua vmunix: is0: resetting slave
Feb 7 08:31:50 cantua vmunix: idc0: ctlr message: 'FW revision date = 8/4/89 ,
 level = 254 '
Feb 7 08:31:50 cantua vmunix: idc0: Recovery complete.

I had replies from:

tlr@toy.rad.msu.edu Terry Rosenbaum
Chris.Drake@Corp.Sun.COM Chris Drake
ira@cis.upenn.edu Ira Winston

-----
Two solutions to similar problems were the replacment of the CPU board
(Terry Rosenbaum) and upgrading to the lastest IPI controllers (Ira Winston).
A detailed reply from Chris Drake follows.

---------
From: IN%"Chris.Drake@Corp.Sun.COM" 8-FEB-1992 07:55:01.10

The controller for the IPI disks is reasonably complex - I believe it has a
68030 on it with a fairly large chunk of PROM software... The first message
indicates a software problem with the controller, definitely - it did a panic
dump and restarted... The missing interrupts may well be due to the restart,
so unless they continued past your sample it is likely that they are not serious
(especially since the controller indicated that recovery was complete).

If this happened once and once only, then it's likely that it was a glitch but
not a serious one. If it happens again or has been occurring, then I'd say
the controller is having serious problems and should either be swapped out or
the firmware should be checked out and possibly upgraded (I don't off hand know
what the current revision level is). In any case, you might contact the local
hardware support office and see what the firmware is *supposed* to be; it may
be worth upgrading anyway (but maybe at a lower priority... :-)

I don't think there is anything wrong with the particular partitions mentioned;
they were just the ones in use at the time the controller burped.

        Chris Drake

                                                                      ___
Bill Rea (o o)
--------------------------------------------------------------------w--U--w---
| Bill Rea, Computer Services Centre | E-Mail b.rea@csc.canterbury.ac.nz |
| University of Canterbury | or cctr114@csc.canterbury.ac.nz |
| Christchurch, New Zealand | Phone +64 3 642-331 Fax +64 3 642-999 |
------------------------------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:36 CDT