SUMMARY: ODS and a loose SCSI connection

From: Karen Barrett (karen@ecomon1.FMR.Com)
Date: Fri Sep 06 1996 - 11:58:50 CDT


Thanks to all who took the time to reply. Unfortunately there doesn't seem to
be any definite answer. It's too hard to tell based on the lack of information
we were able to gather when the problem occurred. Try explaining that to management
though.

Several people suggested going to ODS 4.0 to solve the problem. I agree with this,
but there are known problems running this version with Sybase, which we use a
lot of around here. We'll keep watching for solutions to the 4.0 bugs.

One person (Glenn.Satchell@uniq.com.au) suggested that we redirect our remote
console messages to somewhere where we can see them locally. It's a good idea
we are looking to implement.

Thanks again!

Karen Barrett
UNIX Systems Administrator
Fidelity Investments
karen.barrett@fmr.com

I originally wrote:

> I have a question about ODS...
>
> We are currently running ODS 2.0.1 on a sparc 20 (sol 2.4) with three
> scsi controllers. Each of our file systems on controller 0 is mirrored
> on controller 1, which includes /, /usr, swap, /home and /opt. A few days ago,
> this system hung. Unfortunately I cannot give you the exact error messages
> becuase the machine is in a different geographical area and the person who
> checked the console was a non-UNIX person. Basically it was having trouble
> writing to a disk and then panicked. The people there then discovered that
> a scsi cable on controller 1 appeared to be loose. Once they reconnected it,
> we rebooted and everything came up with no problem.
>
> My question is, if the system cannot write to the mirrored device, shouldn't
> it just continue operating without the mirror? If not, this kind of defeats
> the purpose of having the mirror doesn't it?
>
> Any ideas why the system might have crashed? Nothing was recorded in
> /var/adm/messages before the crash and we have no other indications of what
> might have happened. Any and all suggestions are greatly appreciated!!

Responses:

> From: "Daniel J Blander - Sr. Systems Engineer for ACS"
> <Daniel.Blander@ACSacs.Com>

>
>
> Loose SCSI cables cause very *low* level problems - in ability of the SCSI
> controller to function is one....and if the two drives are (erroneously) on
> the same controller, the system will hang and potentially crash. Even the
> feedback on such a loose cable can cause a tailspin on many a
> system....Loose cables and power outages are the base cause of
> unrecoverability of mirror drive failures....it just doesn't like it....
>
>
--------------------------------------------------------------------------
> From Reggie_Stuart_at_aspenpo@smtpinet.aspensys.com Thu Sep 5 16:05:09 1996
>
> Hopefully others will mail you.
>
> Basically, when UNIX comes up and sees all its devices, it chats with
> them the whole time. When it cannot see one of them, the system will
> complain loudly, ie. by crashing. That's the nature of the SCSI
> beast. In the long run, it is better this way, because you would not
> want to go days without the mirroring taking place.
>
--------------------------------------------------------------------------
> From harry@clark-kent.decisionone.COM Thu Sep 5 16:08:19 1996
>
> No idea as to why it crashed, however if you are running Solaris 2.4, your /home partition should infact be /export/home. /home is a 4.x filesystem convention and is not supported by Solaris.
>
---------------------------------------------------------------------------
>From twhite@bear.com Thu Sep 5 16:22:27 1996

just a guess..
but if the master - state db replica is on that scsi chain
it could cause a hang/panic because it uses that stat db to
know who, what, when and where...right ?
Also .. not sure if ODS can truly handle losing an entire bus
it is prepared to lose a submirror but not a whole mirror.
---------------------------------------------------------------------------
>From steve_turgeon@ppc-191.putnaminv.com Thu Sep 5 16:22:47 1996

do you have the 102580-10 patch installed......... Never mind, thats for ODS
4.0 which will solve the problem....

---------------------------------------------------------------------------
>From Glenn.Satchell@uniq.com.au Thu Sep 5 20:33:29 1996

> From sun-managers-request@uniq.com.au Fri Sep 6 06:07:36 1996
>
> I have a question about ODS...
>
> We are currently running ODS 2.0.1 on a sparc 20 (sol 2.4) with three
> scsi controllers. Each of our file systems on controller 0 is mirrored
> on controller 1, which includes /, /usr, swap, /home and /opt. A few days ago,
> this system hung. Unfortunately I cannot give you the exact error messages
> becuase the machine is in a different geographical area and the person who
> checked the console was a non-UNIX person. Basically it was having trouble
> writing to a disk and then panicked. The people there then discovered that
> a scsi cable on controller 1 appeared to be loose. Once they reconnected it,
> we rebooted and everything came up with no problem.

Bad scsi transfers should get logged in /var/adm/messages, assuming of
course that it could write to a disk (see below). So perhaps the disk
was already offlined and the other disk had a problem of some sort?
Does metastat show all mirrors as fully sincronised?

> My question is, if the system cannot write to the mirrored device, shouldn't
> it just continue operating without the mirror? If not, this kind of defeats
> the purpose of having the mirror doesn't it?

Your assumption is correct, and is the way it "should" work.

> Any ideas why the system might have crashed? Nothing was recorded in
> /var/adm/messages before the crash and we have no other indications of what
> might have happened. Any and all suggestions are greatly appreciated!!

You might want to change /etc/syslog.conf on this system to forward th
econsole syslog messages to another system near you so you can keep an
eye on what's going on.

With no log messages, and only a vague description of th econsole it is
really difficult to diagnose the exact cause. I have a site nearby that
has been running 2.3 with Disksuite 2.0.1 for at least two years
without seeing anything like this problem...

One thing I would suggest is to upgrade to the latest version of
Disksuite (4.0) plus patch 102580.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:09 CDT