Shutdown/Startup procedures for AGLT2 Clusters
Procedures to cleanly bring All AGLT2 activity to a halt
- service cfengine3 stop
- This prevents changes below from being over-written
- Modify /etc/cron.d/modify_quota task on aglbatch to have fixed, low quotas for non-ATLAS jobs.
- Change the script argument to zero zero from one
- 27,57 * * * * /bin/bash /root/condor_quota_config/new_modify_quota.sh 0 0
- Stop auto pilots to AGLT2 and ANALY_AGLT2
- May want to delay ANALY shutoff, depending on queing information
- AtlasQueueControl details this procedure
- Notify pandashift and adcos that pilots are stopped, and the reason
- Identify yourself as with AGLT2, eg, include your DN
- After pilots are ascertained to be stopped and there are no further Idle jobs on gate04, do a peaceful shutdown for Condor on all worker nodes
- cluster_control from a participating machine (aglbatch, umt3int0X) is best * Option 9 with full machine list, for example, two parallel runs, one with UM_machines.txt, one with MSU_machines.txt * Option 2, and specify the Email address to receive shutdown notifications
Now, wait for the actual down-time to arrive.
Stopping ONLY MSU machines, or ONLY UM machines
- If storage is to be shut down, set an outage in OIM starting a few hours prior to the expected outage time, ending an hour or so afterwards
- >36 hours prior to "off" time, log in to AGIS and change "maxtime" parameter from 259000 to 86400 for all Panda Queues of ours, specificaly AGLT2_SL6, ANALY_AGLT2_SL6 and AGLT2_MCORE (in the case of MSU only).
- ~26 hours prior to "off" time, which gives a few hours cushion, start the relevant machines idling down
- Use cluster_control option 2 with the list of machines as above
- Log in to each machine and do "condor_off -peaceful -subsys startd"
- This method will NOT update the cluster_control DB, so next run will show all the machines idled
- log in to AGIS again and change "maxtime" parameter back from 86400 to 259000 for all changed queues.
- A few hours prior to the expected power outage for storage servers, log in to head01 and set all affected disks "rdonly" from the PoolManager
- When ready to shut down everything,
- Power off the WN via "shutdown -h now"
- Power off the storage servers
Before powering off the storage servers, it is probably a good idea to set queues offline so that jobs won't fail by trying to fetch files from the offline pools.
- curl --cert /tmp/x509up_u`id -u` --cacert /tmp/x509up_u`id -u` --capath /etc/grid-security/certificates 'https://panda.cern.ch:25943/server/controller/query?tpmes=setoffline&queue=ANALY_AGLT2_SL6-condor'
- Same command for AGLT2_SL6-condor
- Notify ADCOS shifts <atlas-project-adc-operations-shifts@cern.ch> that you have set the queues temporarily offline, and why
When power is back
- If pool servers were shut down, when they are back up, set the disks back to "notrdonly" on head01 PoolManager
- If the Panda queues were set offline, set them back online (tpmes=setonline)
- Notify ADCOS that the queues are back online
- Power up all WN that were shut down
- Run "/etc/check.sh" on all machines where condor should come back up. There are 2 ways to do this.
- Use cluster_control option 3 on the machines
- log in, run it, and then if result is zero, do "service condor start"
- This method will not update the cluster_control DB
- service cfengine3 start
- At the next run, modified cron scripts will be reverted to their original content
--
BobBall - 13 Feb 2011