Hosts disconnect from fabric causing I/O errors

Michael Greenberg ztorage at yahoo.com
Thu Nov 20 17:05:16 EST 2008


Hello Gurus,

I have a real nasty one here. I set up a new fabric using Brocade 5000 switches (FOS 5.2.1). ALL of the host (Solaris 9, Solaris 10, Nevada, ESX) get disconnected and reconnected arbitrarily. Sometimes that causes too many failovers and SCSI errors that the filesystems crash with I/O errors. It happens with hosts running VxVM and ZFS alike (although ZFS seems to be more stable).
The target is a Thumper running COMSTAR.

The fabric was functioning just fine for a few weeks, until a week ago I had a different server die on me like that. ZFS usually recovers alright, but with the VxVM volumes, it's a pain (umount, cfgadm, vxconfigd, vxdg, vxvol, fsck, mount... ARGH! -- I'm one I/O error away from quitting my job in tears!)

Sun support has suggested FOS upgrade, which will be done first thing Sunday morning. But in the meantine, I would appreciate ANY IDEAS you guys could think of.

Here's an extract from messages on a Solaris 9 box:

Nov 17 10:33:50 s9host qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop OFFLINE
Nov 17 10:33:50 s9host qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(0): Loop ONLINE
Nov 17 10:33:51 s9host fctl: [ID 517869 kern.warning] WARNING: 159947=>fp(1)::N_x Port with D_ID=30c00, PWWN=210000e08b861132 reappeared in fabric
Nov 17 10:33:53 s9host scsi: [ID 107833 kern.warning] WARNING: /pci at 1e,600000/SUNW,qlc at 2,1/fp at 0,0/ssd at w2100001b320e8e82,1 (ssd38):
Nov 17 10:33:53 s9host  SCSI transport failed: reason 'tran_err': retrying command
Nov 17 10:37:30 s9host fctl: [ID 517869 kern.warning] WARNING: 159974=>fp(1)::GPN_ID for D_ID=10100 failed
Nov 17 10:37:30 s9host fctl: [ID 517869 kern.warning] WARNING: 159975=>fp(1)::N_x Port with D_ID=10100, PWWN=2100001b320e8e82 disappeared from fabric
Nov 17 10:37:33 s9host fctl: [ID 517869 kern.warning] WARNING: 159990=>fp(1)::N_x Port with D_ID=10100, PWWN=2100001b320e8e82 reappeared in fabric
Nov 17 10:37:33 s9host fctl: [ID 517869 kern.warning] WARNING: 159999=>fp(1)::GPN_ID for D_ID=10500 failed
Nov 17 10:37:33 s9host fctl: [ID 517869 kern.warning] WARNING: 160000=>fp(1)::N_x Port with D_ID=10500, PWWN=2100001b320e9884 disappeared from fabric
Nov 17 10:37:53 s9host scsi: [ID 243001 kern.info] /pci at 1e,600000/SUNW,qlc at 2,1/fp at 0,0 (fcp1):
Nov 17 10:37:53 s9host  offlining lun=1 (trace=0), target=10500 (trace=2800004)
Nov 17 10:37:53 s9host scsi: [ID 243001 kern.info] /pci at 1e,600000/SUNW,qlc at 2,1/fp at 0,0 (fcp1):
Nov 17 10:37:53 s9host  offlining lun=0 (trace=0), target=10500 (trace=2800004)
Nov 17 10:39:16 s9host fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 10100 failed state=Timeout, reason=Hardware Error
Nov 17 10:39:16 s9host fctl: [ID 517869 kern.warning] WARNING: 160015=>fp(1)::PLOGI to 10100 failed. state=c reason=1.
Nov 17 10:39:16 s9host scsi: [ID 243001 kern.warning] WARNING: /pci at 1e,600000/SUNW,qlc at 2,1/fp at 0,0 (fcp1):
Nov 17 10:39:16 s9host  PLOGI to D_ID=0x10100 failed: State:Timeout, Reason:Hardware Error. Giving up
Nov 17 10:39:36 s9host scsi: [ID 107833 kern.notice]    Unexpected SCSI status received: 0x4
Nov 17 10:39:36 s9host scsi: [ID 243001 kern.info] /pci at 1e,600000/SUNW,qlc at 2,1/fp at 0,0 (fcp1):
Nov 17 10:39:36 s9host  offlining lun=1 (trace=0), target=10100 (trace=2800101)
Nov 17 10:39:36 s9host scsi: [ID 243001 kern.info] /pci at 1e,600000/SUNW,qlc at 2,1/fp at 0,0 (fcp1):
Nov 17 10:39:36 s9host  offlining lun=0 (trace=0), target=10100 (trace=2800101)
Nov 17 10:41:00 s9host fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 10100 failed state=Timeout, reason=Hardware Error
Nov 17 10:41:00 s9host fctl: [ID 517869 kern.warning] WARNING: 160024=>fp(1)::PLOGI to 10100 failed. state=c reason=1.
Nov 17 10:41:00 s9host fctl: [ID 517869 kern.warning] WARNING: 160050=>fp(1)::N_x Port with D_ID=10500, PWWN=2100001b320e9884 reappeared in fabric
Nov 17 10:42:43 s9host fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 10100 failed state=Timeout, reason=Hardware Error
Nov 17 10:42:43 s9host fctl: [ID 517869 kern.warning] WARNING: 160076=>fp(1)::PLOGI to 10100 failed. state=c reason=1.
Nov 17 10:42:43 s9host scsi: [ID 243001 kern.warning] WARNING: /pci at 1e,600000/SUNW,qlc at 2,1/fp at 0,0 (fcp1):
Nov 17 10:42:43 s9host  PLOGI to D_ID=0x10100 failed: State:Timeout, Reason:Hardware Error. Giving up
Nov 17 10:44:27 s9host fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 10100 failed state=Timeout, reason=Hardware Error
Nov 17 10:44:27 s9host fctl: [ID 517869 kern.warning] WARNING: 160085=>fp(1)::PLOGI to 10100 failed. state=c reason=1.
Nov 17 10:46:11 s9host fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 10500 failed state=Timeout, reason=Hardware Error
Nov 17 10:46:11 s9host fctl: [ID 517869 kern.warning] WARNING: 160098=>fp(1)::PLOGI to 10500 failed. state=c reason=1.
Nov 17 10:46:11 s9host scsi: [ID 243001 kern.warning] WARNING: /pci at 1e,600000/SUNW,qlc at 2,1/fp at 0,0 (fcp1):
Nov 17 10:46:11 s9host  PLOGI to D_ID=0x10500 failed: State:Timeout, Reason:Hardware Error. Giving up
Nov 17 10:47:54 s9host fp: [ID 517869 kern.info] NOTICE: fp(1): PLOGI to 10500 failed state=Timeout, reason=Hardware Error

Thanks a bucket!
- Mike.


More information about the sunmanagers mailing list