ControlledShutdown < AGLT2

You are here: Foswiki>AGLT2 Web>MaintenanceProcedures>ControlledShutdown (10 Dec 2013, BobBall)Edit Attach

Shutdown/Startup procedures for AGLT2 Clusters

Shutdown/Startup procedures for AGLT2 Clusters
- Procedures to cleanly bring All AGLT2 activity to a halt
- Stopping ONLY MSU machines, or ONLY UM machines

Procedures to cleanly bring All AGLT2 activity to a halt

service cfengine3 stop
- This prevents changes below from being over-written
Modify /etc/cron.d/modify_quota task on aglbatch to have fixed, low quotas for non-ATLAS jobs.
- Change the script argument to zero zero from one
- 27,57 * * * * /bin/bash /root/condor_quota_config/new_modify_quota.sh 0 0
Stop auto pilots to AGLT2 and ANALY_AGLT2
- May want to delay ANALY shutoff, depending on queing information
- AtlasQueueControl details this procedure
Notify pandashift and adcos that pilots are stopped, and the reason
- Identify yourself as with AGLT2, eg, include your DN
After pilots are ascertained to be stopped and there are no further Idle jobs on gate04, do a peaceful shutdown for Condor on all worker nodes
- cluster_control from a participating machine (aglbatch, umt3int0X) is best * Option 9 with full machine list, for example, two parallel runs, one with UM_machines.txt, one with MSU_machines.txt * Option 2, and specify the Email address to receive shutdown notifications

Now, wait for the actual down-time to arrive.

Stopping ONLY MSU machines, or ONLY UM machines

If storage is to be shut down, set an outage in OIM starting a few hours prior to the expected outage time, ending an hour or so afterwards
>36 hours prior to "off" time, log in to AGIS and change "maxtime" parameter from 259000 to 86400 for all Panda Queues of ours, specificaly AGLT2_SL6, ANALY_AGLT2_SL6 and AGLT2_MCORE (in the case of MSU only).
~26 hours prior to "off" time, which gives a few hours cushion, start the relevant machines idling down
- Use cluster_control option 2 with the list of machines as above
  - OR
- Log in to each machine and do "condor_off -peaceful -subsys startd"
  - This method will NOT update the cluster_control DB, so next run will show all the machines idled
log in to AGIS again and change "maxtime" parameter back from 86400 to 259000 for all changed queues.
A few hours prior to the expected power outage for storage servers, log in to head01 and set all affected disks "rdonly" from the PoolManager
When ready to shut down everything,
- Power off the WN via "shutdown -h now"
- Power off the storage servers

Before powering off the storage servers, it is probably a good idea to set queues offline so that jobs won't fail by trying to fetch files from the offline pools.

curl --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --capath /etc/grid-security/certificates 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=ANALY_AGLT2_SL6-condor'
Same command for AGLT2_SL6-condor
Notify ADCOS shifts <atlas-project-adc-operations-shifts@cern.ch> that you have set the queues temporarily offline, and why

When power is back

If pool servers were shut down, when they are back up, set the disks back to "notrdonly" on head01 PoolManager
If the Panda queues were set offline, set them back online (tpmes=setonline)
- Notify ADCOS that the queues are back online
Power up all WN that were shut down
Run "/etc/check.sh" on all machines where condor should come back up. There are 2 ways to do this.
- Use cluster_control option 3 on the machines
- log in, run it, and then if result is zero, do "service condor start"
  - This method will not update the cluster_control DB
service cfengine3 start
- At the next run, modified cron scripts will be reverted to their original content

-- BobBall - 13 Feb 2011

Topic revision: r3 - 10 Dec 2013, BobBall

AGLT2

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback