SUMMARY: PBM generated system error on Ultra-10

From: Michael Fernando <michael_fdo_at_yahoo.com>
Date: Tue Feb 18 2003 - 23:27:57 EST
This summary is quite late because it took a long time to make sure my
solution was okay.  

----- original query -------------
> We have an ultra-10 that has an A1000 attached (via a PCI differential
> SCSI card).  Over the last few days, the machine has rebooted several times
> with the attached error below.Is this a problem with the system memory, CPU
> or the PCI card?  Or something entirely different? 

> unix: panic[cpu0]/thread=2a100057d60:  
> unix:  
> simba1: PBM detected parity error. 
> simba1: PBM generated system error. 
> simba0: partiy error error caused by upa address=f1001ff8 UPA bytemask=0 
> simba0: partiy error secondary error simba0: PBM detected parity error. 
> simba0: PBM generated system error. 
> pci0: PCI SERRpci0: partiy error error caused by
>    upa address=1fff1001ff8 UPA bytemask=1 
> pci0: partiy error secondary error pci-0: generated partiy error. 
> unix: 
> unix: syncing file systems...
-------------------------------------------

Thanks to Mike at Mike's list for suggesting that it was the CPU 
ecache bug.  I will try to give as much detail as possible about
this system since there doesn't seem to be a lot of info in the 
archives about this particular problem.  At the end, I think, it
was a defective CPU.

<Details>
Using a known good system for spare parts, I swapped the original
memory.  Memory swap went okay, but a day later the system completely
died.  No video or drive activity.  The service contractor had to 
swap in a new system board to get it working again.  Note that he 
tried a new CPU (lower speed though) before swapping the mobo, but no
signal/activity.  Not sure why the CPU change at this point didn't 
fix the system.

With the new mobo (original CPU and memory, minus the SCSI PCI card) 
it ran fine for about a week with a "while (1)" loop running a few 
jobs. When I swapped it into production (now with the original SCSI 
card), it lasted for ~3 days before starting the reboot cycles.

Removed it again from production, removed the SCSI PCI card but kept the
system up, doing pretty much nothing.  Reboot cycles stopped, making the
SCSI card or the PCI riser card strong suspects.  However, about a week
or 10 days later, the reboot cycles started again.

Finally changed the CPU.  It hasn't rebooted with the above erros in
3-4 weeks.  It does have a new CPU and a new system board.  I will
not put this back into production again.  If you encounter this PBM / 
pci0 parity error problems, I would suggest removing it immediately 
from any production work and starting with a new CPU first.

</Details>

-mike
Yahoo! Shopping - Send Flowers for Valentine's Day
http://shopping.yahoo.com
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Tue Feb 18 23:31:49 2003

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:03 EST