SUMMARY: SCSI transport failed

From: Greg Weingart (gwein@ramtron.com)
Date: Thu Nov 06 1997 - 12:19:28 CST


> Hello Sun Managers,
>
> Here is my original post:
> I have a Sparc 1 4/60 running SunOS 4.1.4 that has taken a severe
> performance hit,
> by this I mean everything is slow- disk I/O, network response time, a
> simple df takes
> 9 seconds as opposed to 2 seconds on a similar machine. We have not
> made any
> recent hardware or software changes, additions,... and it had been
> stable for
> several months prior to this incident.
>
> I have used standard utils to try to identify the cause ( e.g. top,
> iostat, vmstat,...)
> but with no luck.
>
> I did find the following errors in /var/adm/messages:
>
> Nov 3 02:34:49 zeus vmunix: esp0: Unrecoverable DMA error on dma
> Nov 3 02:34:49 zeus vmunix: sd3: SCSI transport failed: reason
> 'tran_err': retrying command
> Nov 3 03:08:48 zeus vmunix: esp0: Unrecoverable DMA error on dma
> Nov 3 03:08:48 zeus vmunix: sd0: SCSI transport failed: reason
> 'tran_err': retrying command
>
> This machine has 1 SCSI processor (esp0) and 3 external SCSI hard
> disks on 1 chain, which
> by the way is less than 1 meter and is using active termination.
> I have tried to replace cables, terminator, boot using only sd0 and
> remove the other 2 drives,
> all with no effect.
>
> /dev/sd0, external
> Seagate Hawk (ST32155N) 2.1 Gigabytes
>
> /dev/sd1, external
> Seagate 32430N. Megabytes
>
> /dev/sd3, external
> Seagate Barracuda (ST32171N) 2.1 Gigabytes
>
> My questions are:
>
> 1. What do the messages from the log mean?
> 2. Are ( or could ) the error messages be related to the performance
> hit?
>
> The machine is on hardware maintenance, but our vendor thinks this is
> not a hardware issue?
        
________________________________________________________________________
_____________

        The correct answer about the performance hit came from
birger@Vest.Sdata.No:

>Start perfmeter or vmstat and look for high interrupt rates.
>I have seen this bog down systems. On one occasion a faulty
>UPS was somehow causing a ground loop on the serial port
>used to communicate with the UPS. Unplugging the serial cable
>got rid of the problem, and a new, redesigned cable from the vendor
>was provided.

>Birger

        and I received several good answers about the cause and meaning
of the
        SCSI errors:

        from Richard Cooper [rcooper@capecod.net]

>If this system used to perform OK, and no configuration changes have
been made, I would suspect a problem with
>the on-board SCSI controller, or its DMA path to memory. Transport
errors relate to the "esp" driver, which handles
>the on-board controller, the hardware SCSI connection, and the paths
into memory. I could not find a man page for
>"esp" on my 4.1.3 system. Is there one on 4.1.4?? There are several
type of SCSI errors which are "soft". The
>transfer can be retried, or the synchronous transfer rate can be
reduced.
> Don't let your hardware vendor out of the loop yet!! I don't
think this is a software problem.

>Richard Cooper
 
        from Analyn.Buduso@analog.com

>for all i know, scsi transport failed means that the data never reached
its
>destination and sometimes this is caused by:

>1. cables longer than six meters. in your case, it's less than 1
meter.
>2. power surges problem. so, acquire a surge suppressor or ups.
>3. machine internal disk is usually scsi target 3. make sure that
external
>and secondary disk drives are targeted to 1,2, or 0 and do not
>conflict with each other. also, make sure that the tape drives are
>targeted 4 or 5.
>4. memory configuration could also be the problem especially for
machines
>with sun4c architecture. ensure high-capacity memory chips
>such as 4MB simms are in lower banks, while lower-capacity
>memory chips such as 1MB simms are in the uppper banks.

>hope this helps......

        from Kevin.Sheehan@uniq.com.au

>> Nov 3 02:34:49 zeus vmunix: esp0: Unrecoverable DMA error on dma
>>
>> 1. What do the messages from the log mean?

>They mean the esp DMA chip received an error while trying to do DMA. It
>didn't seem to involve a parity error, or you would have gotten
messages
>to that effect.

>> 2. Are ( or could ) the error messages be related to the performance
>> hit?

>Could be.
>>
> >The machine is on hardware maintenance, but our vendor thinks this is
>> not a hardware issue???

>Could be hardware, could be software. Depends exactly on what
>the DMA error is, and unfortunately you don't get a whole lot
>of info in that regard in 4.x.

> l & h,
> kev

        from Karl E. Vogel [vogelke@c17mis.region2.wpafb.af.mil]

>G> Are ( or could ) the error messages be related to the performance
hit?

> Sure.

>G> The machine is on hardware maintenance, but our vendor thinks this
is
>G> not a hardware issue???

> What on earth does your vendor think "SCSI transport failed" means?

> There's a remote possibility that all you have to do is reseat a
SCSI
> board or cable, or replace a power supply, but I'd bet that you have
a
> cable, board, or drive that's gone bad and needs to be replaced.

        from David Schiffrin [daves@adnc.com]

>I'd say that the messages you got in the log are related to your
>problem, and that you've got a cable, termination, disk, or esp
problem.

>good luck

>-dave

My thanks to all who replied, and especially to Birger
who also diagnosed my performance problem. It seems
that the serial cable serving a printer through a Milan
Fastport on serial port B was the culprit. Once I removed
it, the machine worked fine.

I will investigate the SCSI errors further and repost my findings.

Thanks again to all who took the time to reply and share your
knowledge, that's the only way this list continues to survive.

Greg Weingart
Network Systems Administrator
Ramtron International Corp.
Co Springs, CO
gwein@ramtron.com

>



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:08 CDT