Startup and Shutdown
This page will describe the shutdown and start up procedures for the cluster
Unplanned power problems
This will be a discussion of how to gracefully shutdown everything and how to start things up. A full shutdown of 1221 is not a routine task --- we\x92ve only done a few planned shutdowns and unplanned outages are infrequent.
Recovering from an outage can be stressful, personally I recommend trying to consciously work slowly, make reasonable notes of the state of system and actions taken. The reality is that recovering from a power or large network outage is not something that needs to be rushed --- people expect that an unplanned outage will result in many hours if not upwards of a day of downtime. Following-up to understand and try to mitigate recurrence of unplanned outages should be a priority.
Description of what happens when utility power is dropped.
The electrical circuits that feed the central PDU in 1221 and the CRACs are backed by the lab generator. An automatic transfer switch (part of the power distribution equipment in the basement of BPS) automatically selects which input source (utility or generator power) to pass through. The ATS has a number of parameters it can test to evaluate input power stability and timers that change its switching behavior. When utility power is interrupted, the generator is automatically started and should have stable output within about 10s. Once the ATS \x93sees\x94 the stable generator power, it disconnects the utility power input and connects generator power. When utility power is restored, the ATS waits several minutes, then if the power is stable will drop the generator input and cut back to the utility input. This cut back to utility power is currently (circa 2012) very fast (1/60s ?).
Note that there are two generators for BPS, one for \x93lab\x94 and another for \x93life safety\x94. The circuits for the CRACs and central PDU are on the lab generator. This is 900KW (1200 HP) diesel generator. (I think it is diesel and not natural gas fed.) So, our load (<150KW + whatever the CRACs need) is easily handled. The life safety generator powers things like emergency lights and fans that blow fresh air into the stairwells. The life safety generator has more frequent testing cycle than the lab generator (monthly vs. annual?).
Note that the alarm system is battery backed and doesn\x92t rely on the generators.
It should be noted that losing utility power at MSU has proven to be very rare. The probability seems to be less than once per year (once per 5 years may be a reasonable estimate). The experience is somewhat different than for home or probably even most commercial power customers. Thunderstorms aren\x92t an issue. Also, there are two power feeds from the power plant to BPS, so the electricians are able to service one without interrupting BPS. There is a possibility of an issue at the power plant forcing them to drop power. The electricians tell us that the most probable utility power outages would last an hour or two. (Shorter than one hour is unlikely and more than a few is also less likely.)
The sequence of events seen in 1221/1225 for a power outage is expected to go like this:
- Stable running on utility power
- Utility power interrupted
- lights go out (not on generator backed circuit)
- CRACs lose power
- central PDU loses power
- all compute nodes lose power
- fan doors not powered
- servers remain on but lose non-UPS power
- network hardware remains on but loses non-UPS power
- UPS lose input, output should remain on; UPSs running on battery
- servers on UPS should stay on
- network on UPS should stay on
- most Dell equipment running will speed up fans, particularly the storage shelves which have only half their fans powered. Probably will start beeping.
- Generator starts (within several seconds).
- Automatic Transfer Switch picks up generator input (within a couple seconds after generator start) restoring power to generator backed circuits in BPS labs.
- CRACs regain power and restart automatically
- central PDU regains power, it should immediately have output power available. Its cut-off breaker does not trip when power is dropped.
- non-UPS rack PDUs have power restored; outlets behave as programmed
- compute node outlets remain off
- most others turned back on, servers, storage, network, fan doors
- fan doors repowered
- servers back to two power sources
- network hardware back to two power sources
- UPSs get input back; start charging batteries
- servers back to two power sources
- network equipment back to two power sources
- Dell equipment with redundant power supplies should calm down. Fans resume normal speeds; beep alerts end.
- Utility power restored.
- lights immediately back on
- central PDU and CRACs still on generator power
- Automatic transfer switch cuts back to utility power once its conditions are met, should take at least several minutes (built in delay). This is meant to be almost seamless, but it currently is a trouble point for 1221. The UPSs have been responding to EPO at this point in time (see discussion elsewhere). Anyways, even without the EPO, there is expected to be enough of a glitch at this point that computers won\x92t ride out without UPS. Powered equipment sees glitch
- CRACS (may) stop/start
- central PDU stays on, glitch passed through to output
- servers, network hardware with dual power supplies see glitch on one PS, may react somehow
- pack PDUs may reboot
- UPS should smooth out glitch or cover gap on battery
- Final State
- compute nodes still powered off (their rack PDU outlets still off)
- servers, network etc back to two power sources; should have remained running.
Some comments on the above. There is some chance that the generator fails to start. It is a single point of failure, though is meant to be a reliable system. It is held in a standby state --- basically kept warm to shorten diesel start time to several seconds.
If the outage is short, then 4) may happen before 2) and 3). This shouldn\x92t change the final state.
If all goes as planned in the above, the Tier2 (and Tier3) will see all compute nodes powered down, so all running jobs at MSU are killed, but no other effects. In the past, the Tier2 condor system has failed overall when MSU went offline. It does appear now that condor is handling mass job failures (or disconnects) better.
Currently, the expectation is that the cut back to utility power will result in an EPO glitch (while no mains power is available), powering down UPS and powering off all Tier2/3 systems.
EPO trip
The electrical system in 1221 includes an Emergency Power Off system controlled by the alarm system. The EPO turns off the central PDU, turns off a switch in 1225 that powers off the wall outlet (in the tracks) in 1221/3/5 and also sends signals that are used to tell the UPSs to turn output off. The alarm will active EPO in two instances, if the VESDA smoke level reaches threshold, or if one of the two EPO buttons is activated (one on the PDU and on wall by VESDA). The EPO is intended to help reduce the risk of electrical driven fire and personnel injury.
The CRACs are not turned off by the EPO (need to double check this). The CRACs have an internal fire temperature sensor (160F?) that trips them off.
Activation of the EPO trips the input circuit breaker of the central PDU. This is reset by turning the switch to off and then to on.
The UPSs stay off until manually turned back on after EPO. UPSs that are not correctly connected to the EPO will not be turned off.
Final State: Central PDU off, EPO enabled UPSs off.
Unclear about: when EPO signal is deactivated by alarm (for both VESDA activated and button activated), when the wall outlet relay turns the outlets back on, if the CRACs are turned off.
Graceful shutdown
Described at
MSUFireDrill
Graceful start up
Described at
MSUFireDrill
--
JamesKoll - 19 Feb 2014