Cluster Control
This is the main page for information about the Cluster Monitoring and Control tools. At this time, the state of two conditions is maintained an monitored using these tools. Those conditions are
- Power state, as reflected by response to ping packets
- Condor state, ie, is the worker node accepting jobs to run
NOTE that this is a
MONITORING TOOL ONLY intended to maintain up to date lists of these status indicators. However, other tools such as the push_cmd.sh script (see next paragraph) may inititate actions based upon these conditions that will ultimately change these conditions, eg, running the enhanced mode of the eval_T2_condor_ready.sh script on a worker node may result in setting the Condor state to running.
In addition, a tool (push_cmd.sh) is provided that takes, as input, any command that should be executable via ssh and runs it on the provided list of machines. The list is filtered by default to run on only those machines that have a Condor state accepting jobs to run. Further filters and options are provided by the help option (-h) of the tool.
Installation
InstallPage The installation page is password protected
Setup Script
Following installation, the set of machines selected, all of which must have the Condor schedd running, will add the Cluster control variables via script named clusco_setup.sh (bash only) that is placed in /etc/profile.d. In particular the path to the Cluster Control commands is added to the PATH variable.
Control Interface and command scripts
This interface is not yet implemented. When complete, it will provide access to the command scripts below.
get
This provides access to two shell scripts, one each to get the Power State or Condor State of the indicated machine.
get_condor_state.sh
Usage: /atlas/data08/manage/cluster/get_condor_state.sh machine
Return the condor run/stop state of the specified machine.
Returns zero of stopped/unknown, 1 if startd is running
get_power_state.sh
Usage: /atlas/data08/manage/cluster/get_power_state.sh machine
Return the power on/off state of the specified machine.
Returns zero if off/unknown, 1 if machine is running
set
This provides access to two shell scripts, one each to set the Power State or Condor State of the indicated machine.
set_condor_state.sh
Usage: /atlas/data08/manage/cluster/set_condor_state.sh machine state
Set the indicated Condor state as up or down
State should be yes(up) or no(down) and is case independent
If marking UP (yes), also set the power state UP
set_power_state.sh
Usage: /atlas/data08/manage/cluster/set_power_state.sh machine state
Set the indicated machine as up or down
State should be yes or no (case independent)
Machines marked down are also marked with condor off
No condor change is made to machines marked up
cluster_control
The controlling script cluster_control is the best interface to all of the above commands, plus the push_cmd.sh documented below. It prompts for an option selection from a list, and then for any additional parameters that may be needed to perform the desired actions. There are two modes in which it will operate, interactive and command-line. The interactive version, ie invoked with no arguments whatsoever, presents a list of options (below) and then prompts for any needed arguments.
*****************
Choose an Option
1. What are the assumptions here?
2. Peacefully Idle a Worker Node
3. Start up Condor on a Worker Node
4. Immediately Stop Condor on a Worker Node
5. Get Power state of a Worker Node from the DB
6. Change DB Power state of a Worker Node
7. Get Condor state of a Worker Node from the DB
8. Change DB Condor state of a Worker Node
9. If doing many machines, name of file containing list []
10. Execute a command on the machine list above
11. Print help text
12. Quit
The command-line version is designed for inclusion within scripts. Both versions take either an input file containing list of machines to operate upon, one machine per line, or a comma-separated list of machines. A single wildcard (*) in a each "machine" in the list is allowed. Special machine list names are T2WN, UMWN and MSUWN.
NOTE: Both the interactive and command line versions of cluster_control use ONLY the local network machine names when given a list of machines to operate upon.
Option 10 is not (yet) available in the command-line version. Option 9 is not implemented in this mode.
cluster_control -h
Usage: cluster_control Full Interactive invocation, no arguments accepted
cluster_control [-h|--help]
-c|--command <command number>
-f|--filter machine_names
[-s|--state ON|OFF]
[-e|--email email_address]
command and filter are always required if any arguments are supplied
at all, with the exception of the help command
-h|--help Print this help text and exit
-c|--command <command_number> The command to execute. run the
full interactive command to see the legal list
-f|--filter machine_names Execute command_number on these machines.
This is a comma separated list, with each entry
including up to one wildcard. There are 3 special
wildcards that over-ride any single machine(s);
T2WN, MSUWN and UMWN. The first is exclusive, the
last 2 can be combined (but just equal T2WN then)
-s|--state ON|OFF The state to set in commands that require
this argument.
-e|--email email_address The address to which command completion
notifications are sent, for commands that require
an email address
Unknown or invalid options cause cluster_control to abort with an
error code.
Example commands
- cluster_control -c 5 -f bl-2-19,bl-2-2
- cluster_control -c 2 -f UMWN -e ball@umich.edu
Maintenance Scripts
Various scripts are provided.
crontask.sh
This is a task that can be run daily via crontab entry to validate the state of the maintained DataBase. No actual changes to the DataBase are made. Utilizes seed_up_states.sh in a "safe" mode.
crontask.sh reports
The task is set up to send an Email (default to aglt2-hardware list) with subject "Cluster Control DB Inconsistencies" if a discrepancy of any kind is found. An example of such an Email is shown here.
For more information
See https://hep.pa.msu.edu/twiki/bin/view/AGLT2/ClusterControl
LocalName PublicName Subnet Type POWER CONDOR
65c65
< "bl-6-8","bl-6-8","local","T2_T3_Share","YES","YES"
---
> > "bl-6-8","bl-6-8","local","T2_T3_Share","YES","NO"
293c293
< "cc-106-41","c-106-41","msulocal","T2only","YES","YES"
---
> > "cc-106-41","c-106-41","msulocal","T2only","NO","YES"
328c328
< "cc-117-2","c-117-2","msulocal","ALL12","YES","YES"
---
> > "cc-117-2","c-117-2","msulocal","ALL12","NO","YES"
In the above report:
- bl-6-8, Condor was stopped in such a way that the DB entry was not updated.
- Solution: set_condor_state.sh bl-6-8 NO
- cc-106-41 and cc-117-2, Condor is apparently still running (reported via condor_status), but the machine shows down. The test for "down" consists of sending, and receiving, exactly 4 ping packets. Clearly at least one packet was lost for each of these 2 machines.
- Solution: No action required, problem was certainly transient
seed_up_states.sh
Take the DB and iterate through all machines in the list, checking their Power and Condor states. Used both to set the initial state of the maintained DB, and monitor it via crontask.sh (above).
Admins, run this with caution.
Action Scripts
push_cmd.sh
Tool that will take lists of machines, and run them either locally or remotely via ssh.
Usage: /atlas/data08/manage/cluster/push_cmd.sh [-f filename|-m machine_list] \
[-e UP|CONDOR|NONE] [-l|-r] [-h] command
/atlas/data08/manage/cluster/push_cmd.sh executes commands on all the machines in a list.
-h gives a brief command summary.
If command type is -l, use ssh on the local NIC of the named host.
If command type is -r (the default), run the command on this
local host aglbatch, using the public NIC name of hosts from the
named file as the last argument of the command.
Use -f <filename> to name a file with a list of host names.
Enter one host per line in the file, either public name or private.
The domainname is not required.
The default file name is $CWD/machines.txt
Use -m <machine_list> to pass a list of comma-separated host names
with each entry including up to one wildcard. There are
3 special wildcards that over-ride any single machine(s);
T2WN, MSUWN and UMWN. The first is exclusive, the
last 2 can be combined (but just equal T2WN then).
The -m and -f options are mutually exclusive.
Use -e [UP|CONDOR|NONE] to exclude machines that are not UP,
or not running CONDOR, or just exclude NONE at all in the list.
ON is a synonym for UP
RUN is a synonym for CONDOR
By default, any machine not running CONDOR jobs will be excluded
The balance of the command line will be the command to execute.
For interactive machines, the $USER account will be used with sudo
Command failures and machine exclusions are logged to
$CWD/failed.list
Example commands
- push_cmd.sh -f UM_machines.txt -e UP -l "date"
--
BobBall - 30 May 2011