Well, it's not really a summary, but I was *finally* able find
the fix.  The real problem, which is not at all obvious from
my original message to sun-managers (appended), is that
our (fast differential SCSI) Micropolis 1924D disks were timing 
out on an SS10 like this:
        esp2: Disconnected command timeout for Target 0 Lun 0
This happens both with Sun's DSBE/S and with Performance Technologies'
PT-SBUS430 controllers.  It can be triggered by any moderate disk
activity that causes lots of SCSI disconnects to be interleaved (say,
a find(1) running in parallel on each of two disks).
I won't bore you with how many dead ends we explored.  The problem
turned out to be a bug in Micropolis' firmware.  We haven't seen
a timeout since Micropolis updated the f/w two weeks ago.  Micropolis
says that new disks are now being shipped with the fixed f/w. 
So if your 1924D's are over a month old and if you are seeing these
timeouts, then you need to talk with Micropolis.
> From fletcher Wed May 26 23:30:03 1993
> From: Fletcher Mattox <fletcher@cs.utexas.edu>
> To: sun-managers@eecs.nwu.edu
> Subject: SCSI overruns?
> 
> Our new SS10 running SunOS 4.1.3 is getting SCSI overruns.
> There are four 2.4GB Micropolis 1924 disks on this SCSI bus.
> The bus is passive terminated.  (We will soon try active termination).
> This machine has prestoserve installed, and there appears to be a
> correlation with prestoserve and the overruns.  I.e. the overruns
> haven't recurred since we turned off presto.
> 
> I don't think it's the disk since I see errors on both sd4 and sd6.
> 
> Is this a cable/termination problem?  Is prestoserve known to aggravate
> this problem?
> 
> Thanks
> Fletcher
> 
> 
> sd4:    SCSI transport failed: reason 'data_ovr': retrying command
> sd4:    SCSI transport failed: reason 'incomplete': retrying command
> sd4:    disk not responding to selection
> sd4:    disk not responding to selection
> presto: error on dev (7, 34)
> esp1:   data transfer overrun
>         State=DATA Last State=DATA_DONE
>         Latched stat=0x11<XZERO,IO> intr=0x10<BUS> fifo 0x80
>         last msg out: <unknown msg 0xff>; last msg in: COMMAND COMPLETE
>         DMA csr=0x40040010<INTEN>
>         addr=fff0017c last=fff00168 last_count=14
>         Cmd dump for Target 3 Lun 0:
>         cdb=[ 0x3 0x0 0x0 0x0 0x14 0x0 ]
>         pkt_state 0xb<XFER,SEL,ARB> pkt_flags 0x0 pkt_statistics 0x0
>         cmd_flags=0x25 cmd_timeout 35
>         Mapped Dma Space:
>                 Base = 0x168 Count = 0x14
>         Transfer History:
>                 Base = 0x168 Count = 0x14
>         current phase 0x26=DATAIN       stat=0x11       0x14
>         current phase 0x20=SELECT       stat=0x10       0x3     0x0
>         current phase 0x1=CMD_START     stat=0x10       0x3     0x20
>         current phase 0xb=CMD_CMPLT     stat=0x17       0xc00
>         current phase 0x27=STATUS       stat=0x17       0x2
>         current phase 0xb=CMD_CMPLT     stat=0x13
>         current phase 0x20=SELECT       stat=0x0        0x3     0x0
>         current phase 0x1=CMD_START     stat=0x0        0xa     0x20
>         current phase 0x20=SELECT       stat=0x0        0x3     0x0
>         current phase 0x1=CMD_START     stat=0x0        0xa     0x20
>         current phase 0x20=SELECT       stat=0x0        0x3     0x0
>         current phase 0x1=CMD_START     stat=0x0        0x3     0x20
>         current phase 0x60=SELECT_SNDMSG        stat=0x0        0x3     0x0
>         current phase 0x23=SYNCHOUT     stat=0x0        0x19    0xf
>         current phase 0x1c=RESET        stat=0x0        0x10
>         current phase 0x1c=RESET        stat=0x11       0x7
> sd4:    SCSI transport failed: reason 'data_ovr': giving up
> presto: disabling...
> sd4:    disk not responding to selection
> sd6:    SCSI transport failed: reason 'reset': retrying command
> 
> 
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:58 CDT