This page is obsolete Hardware maintenance is now logged at http://glpi.aglt2.org/

MSU Hardware Repairs

Until we have a better system, I'm recording hardware repairs here. More general service or changes are also recorded in a section below.

Kernel Panics by Date

Kernel crashes not attributed to a specific hardware problem. (Machine check exceptions should go in other section).

  • Jan 2008, dc2-102-4 is crashing while running Atlas jobs, has done so a few times.
  • Jan ??? 2008, dc2-102-5 crashed. rebuilt.
  • Jan 2008. Lots of "Web100" crashes found. Changed to a different kernel to resolve.
  • Mar 10, 2008. dc2-102-4 "nfs_access_zap_cache". Another kernel crash on this stupid node frown, sad smile

Hardware Errors and/or Parts Replacement by Node

dc2-102-4

When the mpmemory program is run from the Dell Diagnostic, it says that DIMM4 and DIMM8 transitioned...

SEL shows errors:

bash-3.00# ipmitool -H 10.10.3.251 -U root -P PASSWORD sel list
   1 | 02/11/2008 | 03:57:38 | Event Logging Disabled #0x72 | Log area reset/cleared | Asserted
   2 | 02/11/2008 | 05:13:15 | Memory #0x1b | Transition to Non-critical from OK
   3 | Pre-Init Time-stamp   | Physical Security #0x73 | General Chassis intrusion | Asserted
   4 | Pre-Init Time-stamp   | Physical Security #0x73 | General Chassis intrusion | Deasserted
   5 | 02/11/2008 | 06:39:04 | Memory #0x1b | Transition to Non-critical from OK
   6 | Pre-Init Time-stamp   | Physical Security #0x73 | General Chassis intrusion | Asserted
   7 | Pre-Init Time-stamp   | Physical Security #0x73 | General Chassis intrusion | Deasserted

bash-3.00# ipmitool -H 10.10.3.251 -U root -P PASSWORD sel elist
   1 | 02/11/2008 | 03:57:38 | Event Logging Disabled SEL | Log area reset/cleared | Asserted
   2 | 02/11/2008 | 05:13:15 | Memory Mem ECC Warning | Transition to Non-critical from OK
   3 | Pre-Init Time-stamp   | Physical Security Intrusion | General Chassis intrusion | Asserted
   4 | Pre-Init Time-stamp   | Physical Security Intrusion | General Chassis intrusion | Deasserted
   5 | 02/11/2008 | 06:39:04 | Memory Mem ECC Warning | Transition to Non-critical from OK
   6 | Pre-Init Time-stamp   | Physical Security Intrusion | General Chassis intrusion | Asserted
   7 | Pre-Init Time-stamp   | Physical Security Intrusion | General Chassis intrusion | Deasserted

dc2-104-?

Dec 15, 2007 Replaced DIMM

dc2-104-?

  • Dec 15, 2007 Replaced DIMM

msufs01

  • Jan 11, 2008 msufs01, bottom md1000 shelf, replaced left (active) EMM

msufs04

  • Jan 16, 2008 perc 5/e card replaced

Cluster Service or Changes

2008

  • Feb 28, 2008
    • Put one Gore cable into stacking ring
    • Gore 10GE cables on msufs04 and 05
    • re-ziptied power cords on compute nodes to make them truly secure
  • Mar 7, 2008 Installed perc6/e cards in msufs01-msufs05. Rearranged SAS cables to bundle at left rear side. Labeled shelves and cables A thru D.
  • Mar 7, 2008 Put Gore CX4 cables on msufs01-msufs03

-- TomRockwell - 16 Jan 2008
Topic revision: r2 - 27 Feb 2014, JamesKoll
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback