This page is obsolete
Hardware maintenance is now logged at http://glpi.aglt2.org/
MSU Hardware Repairs
Until we have a better system, I'm recording hardware repairs here. More general service or changes are also recorded in a section below.
Kernel Panics by Date
Kernel crashes not attributed to a specific hardware problem. (Machine check exceptions should go in other section).
- Jan 2008, dc2-102-4 is crashing while running Atlas jobs, has done so a few times.
- Jan ??? 2008, dc2-102-5 crashed. rebuilt.
- Jan 2008. Lots of "Web100" crashes found. Changed to a different kernel to resolve.
- Mar 10, 2008. dc2-102-4 "nfs_access_zap_cache". Another kernel crash on this stupid node
Hardware Errors and/or Parts Replacement by Node
dc2-102-4
When the mpmemory program is run from the Dell Diagnostic, it says that DIMM4 and DIMM8 transitioned...
SEL shows errors:
bash-3.00# ipmitool -H 10.10.3.251 -U root -P PASSWORD sel list
1 | 02/11/2008 | 03:57:38 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
2 | 02/11/2008 | 05:13:15 | Memory #0x1b | Transition to Non-critical from OK
3 | Pre-Init Time-stamp | Physical Security #0x73 | General Chassis intrusion | Asserted
4 | Pre-Init Time-stamp | Physical Security #0x73 | General Chassis intrusion | Deasserted
5 | 02/11/2008 | 06:39:04 | Memory #0x1b | Transition to Non-critical from OK
6 | Pre-Init Time-stamp | Physical Security #0x73 | General Chassis intrusion | Asserted
7 | Pre-Init Time-stamp | Physical Security #0x73 | General Chassis intrusion | Deasserted
bash-3.00# ipmitool -H 10.10.3.251 -U root -P PASSWORD sel elist
1 | 02/11/2008 | 03:57:38 | Event Logging Disabled SEL | Log area reset/cleared | Asserted
2 | 02/11/2008 | 05:13:15 | Memory Mem ECC Warning | Transition to Non-critical from OK
3 | Pre-Init Time-stamp | Physical Security Intrusion | General Chassis intrusion | Asserted
4 | Pre-Init Time-stamp | Physical Security Intrusion | General Chassis intrusion | Deasserted
5 | 02/11/2008 | 06:39:04 | Memory Mem ECC Warning | Transition to Non-critical from OK
6 | Pre-Init Time-stamp | Physical Security Intrusion | General Chassis intrusion | Asserted
7 | Pre-Init Time-stamp | Physical Security Intrusion | General Chassis intrusion | Deasserted
dc2-104-?
Dec 15, 2007 Replaced DIMM
dc2-104-?
- Dec 15, 2007 Replaced DIMM
msufs01
- Jan 11, 2008 msufs01, bottom md1000 shelf, replaced left (active) EMM
msufs04
- Jan 16, 2008 perc 5/e card replaced
Cluster Service or Changes
2008
- Feb 28, 2008
- Put one Gore cable into stacking ring
- Gore 10GE cables on msufs04 and 05
- re-ziptied power cords on compute nodes to make them truly secure
- Mar 7, 2008 Installed perc6/e cards in msufs01-msufs05. Rearranged SAS cables to bundle at left rear side. Labeled shelves and cables A thru D.
- Mar 7, 2008 Put Gore CX4 cables on msufs01-msufs03
--
TomRockwell - 16 Jan 2008