SUMMARY(LONG): Locating bad memory chip in 3/160 memory board

From: John Valdes (valdes@geosun.uchicago.edu)
Date: Tue Jun 09 1992 - 18:03:38 CDT


Hello all,

About a month ago, I had asked:

>A bit in one of the memory boards in one of our 3/160's has apparently
>gone bad. Booting this machine with the 'diag' switch set gives the
>error
>
> Err 11: Parity Error at 0x0063C004
> Exp 0x5A972C5A, Obs 0x5A172C5A, Xor 0x00800000
>
>during the memory test. Anyone know how to find the offending chip
>given the address reported in the error? The board in question is a
>Sun 4MB memory board, Sun part# 501-1132 (Rev 52), is located in the
>second slot of the card cage, next to the CPU board (which has 4MB of
>memory itself), and is populated with an 8x18 array of Mitsubishi
>MN41256-12 chips. Anyone know the specs for this chip (ie., static or
>dynamic, size, speed (120ns I assume))?

With the help of this newsgroup, I was able to track down the offending
chip and replace it. The board is now working again with all 4MB in our
3/160. Total repair cost: $4 (I replaced two chips, the first by
mistake!) and probably 10 cents in solder.

I've attached below a summary of how to track down the faulty chip which
I prepared from the information I received.

Many thanks to those who responded:

uug@cpsc.ucalgary.ca (William Graham)
nanook@eskimo.celestial.com (Robert Dinse)
tarsa@elijah.mv.com (Greg ...)
boyle@wrl.dec.com (Patrick Boyle)

And any others I've left out!

John Valdes Department of the Geophysical Sciences
valdes@geosun.uchicago.edu University of Chicago

---------------------------------------------------------------------------

                      Repairing a Sun 4MB Memory Board

                                John Valdes
                           University of Chicago
                         valdes@geosun.uchicago.edu
                                   6/4/92

Introduction
------------

  This report is based on my experiences and the information I received from
the newsgroup comp.sys.sun.hardware when a Sun Microsystems Int. 4MB memory
board, part# 501-1132, in one of our Sun 3/160's failed. The information
below may also apply to Sun's 2MB memory board, part# 501-1131, as this
board may have the same chip arrangement (I haven't verified this, however).
These boards were used in Sun's 3/1x0 series of computers (and perhaps in
others). The information given below is correct to the best of my
knowledge, but of course, it is presented without warranty and neither I nor
the U. of Chicago can assume any responsibility for anything that may happen
as a result of it (hey, it worked for me...!). I would be glad to receive
any corrections or clarifications.

Diagnosing the Problem
----------------------

  If a chip in one of the memory boards in your Sun3 fails, you will most
likely discover it while the computer is running unix. Your system will
probably panic with an "unknown memory error" similar to:

  vmunix: Memory Error Register d4<INTR,INTENA,CHECK,ERR16>
          d4<INTR,INTENA,CE_ENA,WBACKERR>
  vmunix: DVMA = 0, context 4, virtual address = de06008
  vmunix: pme = d300031e, physical address = 63c008
  vmunix: panic: unknown memory error
  vmunix: syncing file systems...

The exact message, of course, will depend on the location of the error, what
the machine was doing at the time, and the version of SunOS running on your
machine.

  To find the exact address of the error, you will need to boot your Sun3 in
DIAG mode in order to run a complete memory test. If not already there,
bring your machine down to the PROM monitor (the prompt will be a '>') and
move the switch on the back of the CPU board from 'NORM' to 'DIAG'. Then,
with a terminal attached to serial port A on the CPU board (set the terminal
characteristics to 9600 baud, 8 data bits, 1 stop bit, no parity;
alternately, you can connect a terminal to serial port B at 1200 baud),
either type 'k2' at the monitor prompt or press the 'RESET' button on the
CPU board to restart the machine. The machine will then run through a
series of self tests, as it normally does, but with the switch set to DIAG,
the machine will also echo the progress of the self tests to the terminal.
For 3/1x0 machines your terminal display should look something like:

  Boot PROM Selftest

    PROM Checksum Test
    DVMA Reg Test
    Context Reg Test
    ...
    Parity Test
    Memory Size = 0x00000xxx Megabytes
    Memory Test (testing xxxxxxxx MBytes)

The last test to run is the memory test. If your machine doesn't make it
this far, then something else is wrong with your system. Consult Sun's
"PROM User's Manual" for a description of the test which failed.

  Once at the memory test, the firmware will test all of the memory
installed in the machine, regardless of the current setting of the EEPROM
memory test parameter. If your system does indeed have a bad memory chip,
the memory test will fail with an error similar to

  Err 11: Parity Error at 0x0063C004
  Exp 0x5A972C5A, Obs 0x5A172C5A, Xor 0x00800000

and the system will continue to loop at this point (and, hence, any other
bad addresses following this one will not be found). The exact message may
depend on your PROM level. In any case, the message will tell you the start
address of the 4-byte word containing the error, the value written to the
word at that address (Exp), and the value read from the address (Obs).
Write down the message, as you will need the address, the xor value and the
error number to determine the defective chip. The xor value indicates which
data bit is in error (if no xor value is reported, simply compute it from
the Exp and Obs values). If the xor value is zero (0x00000000), then the
error is in a parity bit. In this case the error number will indicate which
bit of the four parity bits in the word is bad (the error number should be
0xd8, 0xd4, 0xd2 or 0xd1).

  If the memory self-test completes successfully without reporting any
errors, run the self-test two or three more times. If all tests succeed,
credit the initial error to cosmic rays, reboot the machine (don't forget
to set the DIAG switch back to NORM), and get back to work!

Locating the Offending Chip
---------------------------

  Given the address and the xor value reported by the memory self-test, it
is fairly straight forward to locate the bad memory chip. From the address
first determine which memory board (if you have more than one) contains the
bad chip. The board with the bad chip is the one which has the largest base
address which is less than the address of the error. Memory is mapped
sequentially between boards, so if your system has two memory boards, for
example, with 4MB installed on the CPU board, 2MB of memory on the first
memory board, and 4MB on the second memory board, then the 2MB board has a
base address of 0x400000, and the 4MB board has a base address of 0x600000
(the CPU board always has a base address of 0x0, of course). Typically, the
memory boards are installed from left to right in the card cage (when
looking at the system from the back) in order of increasing base address, so
that the first memory board will be the first memory board located to the
right of the CPU board (there may be an FPA or graphics board between the
CPU board and the first memory board), the second memory board will be the
second memory board to the right of the CPU board, and so on. Ultimately,
however, the base address of the board is determined by a set of switches or
jumpers on the board itself, so it may be possible to have the order of the
memory boards shuffled in the card cage. Assuming, that the boards are
installed in order, then for each one, subtract the base address of the
board from the address reported by the memory error, and if the result lies
within the capacity of the board, then that's the board with the bad chip.
If your system has 4MB of memory on the CPU board, and the memory error is
located at an address less than 0x400000 (or at an address less than
0x200000 if there's only 2MB of memory on the CPU board), then the bad chip
is on the CPU board itself. In this case, the information below will be of
little use in helping you find it.

  Once you've located the board you believe to contain the bad chip, remove
it from the card cage (be sure that the power is OFF before removing it!!!)
and verify that the base address of the board is set to what you think it
should be by checking the switch settings. For the Sun 4MB board, (and the
2MB board) there are two DIP switches, U3118 and U3119, located as shown
below for setting the base address of the board.

 
        V |
        M +-|
        E | |
           | |
        C | | +----- short for 2MB Board
        o | | |
        n | | | +-- short for 4MB Board
        n | | | |
        e | | V V
        c | | o o +------+ +------+ +------+ +------+
        t | | I | DIP | | DIP | | DIP | | DIP | . . .
        o | | o o +------+ +------+ +------+ +------+
        r +-| jumper
             |
             | +----+ +----+
             | | | | |
             | | | | |
             | | | | |
             | +----+ +----+
             | U3118 U3119
             |
                                      
        Location of switches U3118 and U3119 (Based on diagram from
             "Sun 3/160 Hardware Installation Manual," pg. 50)

The switches will set the base address of the board as given in the table
below.

           +----------------------------------------------------+
           | Base Address | U3118 setting^ | U3119 setting^ |
           |----------------|-----------------|-----------------|
           | 0x200000 | 2 ON | 3 ON |
           | 0x400000 | 3 ON | 4 ON |
           | 0x600000 | 4 ON | 5 ON |
           | 0x800000 | 5 ON | 6 ON |
           | 0xA00000 | 6 ON | 7 ON |
           | 0xC00000 | 7 ON | 8 ON |
           +----------------------------------------------------+
            ^switches other than the one specified are OFF
                                      
             Switch settings for 4MB board (Based on table from
             "Sun 3/160 Hardware Installation Manual," pg. 51)

(The switch settings for Sun's 2MB board are:

                    +----------------------------------+
                    | Base Address | U3118 setting |
                    |----------------|-----------------|
                    | 0x200000 | 2 ON |
                    | 0x400000 | 3 ON |
                    | 0x600000 | 4 ON |
                    | 0x800000 | 4 ON |
                    | 0xA00000 | 4 ON |
                    | 0xC00000 | 4 ON |
                    | 0xE00000 | 4 ON |
                    +----------------------------------+
                                      
             Switch settings for 2MB board (Based on table from
             "Sun 3/160 Hardware Installation Manual," pg. 51)

The setting for 0x800000 through 0xE00000 look odd to me, but this is what
the manual shows.)

If the board you removed wasn't the last memory board in the system, you
should reconfigure the other memory boards to plug the hole in the address
space left by the one you removed. To do this, you'll have to set the
appropriate switches on one or more of the boards to set the correct base
address for it. It is also a good idea to physically order the boards by
base address as mentioned previously--this may actually be necessary in
order to the system to work.

  With the suspect memory board removed and the other memory boards
correctly configured, and with the DIAG switch still set on the CPU board,
power up the system in order to run memory diagnostics again. If the system
now passes the memory test, then you've found the correct board and the
others are properly configured. If the memory test fails with the same
error at the same location (or at an integral multiple of MB elsewhere),
then you've removed the wrong board; power down the system and try again (if
the error was off by an integral multiple of MB, then the bad board is one
of the ones you've reconfigured). If the memory test fails with a
completely different error, then you may have another bad memory board, or
perhaps a problem with the VME backplane or VME connectors to the memory
board. After the system passes the diagnostic self-tests, and if you decide
to fully reboot the machine without the memory board, be sure to set the
DIAG switch back to NORM and to adjust the EEPROM values for memory size
(q14) and memory to test (q15) appropriately.

  Finally, you're ready to locate the bad chip on the board itself. The
501-1132 4MB memory board has an 8 row by 18 column array of memory chips.
The rows are indexed by letter (A, B, C, D, E, F, H, J) while the columns
are indexed by number (4-21) as silk-screened onto the board. The 8 rows
can be subdivided in 4 "row pairs" or "banks", which are similar to SIMM
banks in that each 4-byte word is contained within a single bank. The four
banks are formed by the row pairs as follows:

   Bank 0: Row pair J,H
   Bank 1: Row pair F,E
   Bank 2: Row pair D,C
   Bank 3: Row pair B,A

The bits for each word in a bank are arranged among the columns according to
 
  21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 <- column number
  -----------------------------------------------------

   P 16 17 18 19 20 21 22 23 P 24 25 26 27 28 29 30 31 <- bit numbers of
                                                            rows J,F,D,B
   P 0 1 2 3 4 5 6 7 P 8 9 10 11 12 13 14 15 <- bit numbers of
                                                            rows H,E,C,A

where P stands for a parity bit. Bit 31 is the most significant bit in the
word, and bit 0 the least.

  The 4MB on the board are divided into 0x800 byte (2K) partitions and
distributed circularly among the four banks. The first 2K is mapped to
bank 0, the next 2K to bank 1, the next to bank 2, then bank 3, then back
to bank 0, bank 1, and so on. The mapping for the first 256K is given
below.
 
  Bank 0 (Row pair J,H):
    00000->007FF, 02000->027FF, 04000->047FF,
    06000->067FF, 08000->087FF, 0A000->0A7FF,
    0C000->0C7FF, 0E000->0E7FF
    10000->107FF, 12000->127FF, 14000->147FF,
    16000->167FF, 18000->187FF, 1A000->1A7FF,
    1C000->1C7FF, 1E000->1E7FF
    20000->207FF, 22000->227FF, 24000->247FF,
    26000->267FF, 28000->287FF, 2A000->2A7FF,
    2C000->2C7FF, 2E000->2E7FF
    30000->307FF, 32000->327FF, 34000->347FF,
    36000->367FF, 38000->387FF, 3A000->3A7FF,
    3C000->3C7FF, 3E000->3E7FF

  Bank 1 (Row pair F,E):
    00800->00FFF, 02800->02FFF, 04800->04FFF,
    06800->06FFF, 08800->08FFF, 0A800->0AFFF,
    0C800->0CFFF, 0E800->0EFFF
    10800->10FFF, 12800->12FFF, 14800->14FFF,
    16800->16FFF, 18800->18FFF, 1A800->1AFFF,
    1C800->1CFFF, 1E800->1EFFF
    20800->20FFF, 22800->22FFF, 24800->24FFF,
    26800->26FFF, 28800->28FFF, 2A800->2AFFF,
    2C800->2CFFF, 2E800->2EFFF
    30800->30FFF, 32800->32FFF, 34800->34FFF,
    36800->36FFF, 38800->38FFF, 3A800->3AFFF,
    3C800->3CFFF, 3E800->3EFFF

  Bank 2 (Row pair D,C):
    01000->017FF, 03000->037FF, 05000->057FF,
    07000->077FF, 09000->097FF, 0B000->0B7FF,
    0D000->0D7FF, 0F000->0F7FF
    11000->117FF, 13000->137FF, 15000->157FF,
    17000->177FF, 19000->197FF, 1B000->1B7FF,
    1D000->1D7FF, 1F000->1F7FF
    21000->217FF, 23000->237FF, 25000->257FF,
    27000->277FF, 29000->297FF, 2B000->2B7FF,
    2D000->2D7FF, 2F000->2F7FF
    31000->317FF, 33000->337FF, 35000->357FF,
    37000->377FF, 39000->397FF, 3B000->3B7FF,
    3D000->3D7FF, 3F000->3F7FF

  Bank 3 (Row pair B,A):
    01800->01FFF, 03800->03FFF, 05800->05FFF,
    07800->07FFF, 09800->09FFF, 0B800->0BFFF,
    0D800->0DFFF, 0F800->0FFFF
    11800->11FFF, 13800->13FFF, 15800->15FFF,
    17800->17FFF, 19800->19FFF, 1B800->1BFFF,
    1D800->1DFFF, 1F800->1FFFF
    21800->21FFF, 23800->23FFF, 25800->25FFF,
    27800->27FFF, 29800->29FFF, 2B800->2BFFF,
    2D800->2DFFF, 2F800->2FFFF
    31800->31FFF, 33800->33FFF, 35800->35FFF,
    37800->37FFF, 39800->39FFF, 3B800->3BFFF,
    3D800->3DFFF, 3F800->3FFFF

  You can determine which bank contains the bad chip using the formula

    bank = ( addr / 0x800 ) % 0x4

where addr is the address of the error, '/' is the integer division operator
and '%' is the modulus operator. Then from the xor value of the error, you
can finally locate the bad chip using the bit-to-column mapping given above;
simply convert the xor value to binary and see which bit contains the '1'.
If the xor value is 0, then the error is in one of the four parity bits for
the word. In this case, use the error number to find the chip as follows:

    Error# 0xd8: parity bit for MSByte: {J,F,D,B}12
    Error# 0xd4: parity bit for MSByte-1: {J,F,D,B}21
    Error# 0xd2: parity bit for MSByte-2: {H,E,C,A}12
    Error# 0xd1: parity bit for LSByte: {H,E,C,A}21

For example, for the error

    Err 11: Parity Error at 0x0063C004
    Exp 0x5A972C5A, Obs 0x5A172C5A, Xor 0x00800000

bank = 0 and bit = 23. Hence, the bad chip is located at J13 on the memory
board.

Replacing the Chip
------------------

  I won't say too much about replacing the chip itself. The memory chips
used on the 4MB board are 256Kx1, 120ns, DIP DRAM and are very cheap and
easy to find.

  When removing chip from the board, it is best to clip the leads on the
chip close to the body, and then remove the individual leads from the board
one at a time, heating them from the bottom and pulling them up from the
top. Be sure to be gentle when clipping and pulling the leads to prevent
breaking any of the traces on the board. Finally, after removing all the
leads, clean away all the excess solder with a solder sucker, and solder the
new chip into place.

Finishing Up
------------

  Once you've replaced the chip, reinstall the board in the system--
resetting its base address, if necessary--and run the memory diagnostics on
it. If the test passes, congratulations! You've just repaired your board!
If the test fails with another error, repeat all of the above until all of
the errors are gone (remember, the self-test can only diagnose one error at
a time). If the test gives the same error (possibly offset by a few MB if
you've reconfigured the board), then either you've replaced the wrong chip,
or something else is wrong with the board. Double check your work to make
sure you located the chip correctly. (I did this once!)

  With the board finally working again, set the DIAG switch on the CPU board
back to NORM, adjust the EEPROM values in q14 and q15 if necessary, and
reboot. You're now back in business.

Acknowledgements
----------------

  I gathered most of this information from Sun's "PROM User's Manual" and
"Sun 3/160 Hardware Installation Manual". Many thanks are also due to

  William Graham, uug@cpsc.ucalgary.ca
  Robert Dinse, nanook@eskimo.celestial.com
  Greg (sorry, I don't have your last name), tarsa@elijah.mv.com
  Patrick Boyle, boyle@wrl.dec.com

from whom I received most of the information on tracking down the chip
location given its memory address.

----------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:43 CDT