AGLT2 Compute Node Health Assessment Utilities
General Goals
These compute mode health assessment utilities were designed to assist in managing the AGLT2 compute nodes. A node's "Health" refers to its readiness to fulfil its primary purpose, i.e. run Atlas grid jobs.
The procedure which was typically followed prior to the "health check" functionality was labor intensive. After rebuilding a set of compute nodes, we would need to check and sequentially evaluate if each node was ready to receive grid jobs. When a node is rebuilt by Rocks, and assuming the rebuild process is successful, the condor_master service will be started on the node but intentionally not configured to immediately accept condor jobs. For a node to be ready to accept Atlas Jobs, a number of preconditions need to be verified (described below), and we had been checking these conditions manually. The operator had to keep tabs on which nodes were ready and for those nodes issue the "condor_on" command from the condor negotiator node (currently aglbatch.aglt2.org).
The situation was improved with the set of scripts we call here the Health Assessment Utilities. This includes
- node level scripts to assess a node's own "health" status, performing automatically the tests we were previously doing manually
- centralized summaries of such health status for all nodes, allowing automatic condor startup and continuous monitoring.
The node level scripts check for typical problem conditions in a systematic manner. These scripts can check that each of the individual prerequisites (afs, pnfs, etc, described further down) are met. If all tests are successfully passed the node may then be enabled to start accepting jobs. If one or more of the tests has failed, the health script results will point the operator at which nodes needs attention and give an initial hint of the source of the problem.
Each compute node is also able to report its own health status for further processing. This information is currently used 2 ways
Each node is able to report its readiness status by creating a telltale file in one common remote directory. A separate central script on the condor negotiator node monitors this directory and automatically issues the condor_on command for each node which has declared itself ready. This mode of health assessment is not performed automatically, but only manually under supervision of the administrator.
Each node also makes its health status available as monitoring information on a continuous basis. To produce summarized reports monitoring the whole system, this health status is now also merged into our existing Compute Summary web pages. The most efficient way to accomplish this was to add a ganglia metric to each compute node. This is similar to e.g. the rocks-provided load_fifteen metric which was the information that first motivated the creation of our Compute Summary web pages.
Summary of User Interfaces to the Health Assessment Utilities
- The configuration file for the tests on each compute node
/etc/health.conf
- Two entries are present in /etc/crontab to:
- invoke the script performing the health checks (currently every 15 mn)
- invoke the script updating the ganglia metric (currently every 1 mn)
- The main health script /etc/health.d/eval_T2_condor_ready.sh
The administrator may run this script interactively with any arbitrary argument (e.g. "check") to perform and display the result of all the health tests without interfering with the crontab use of this script. No action is taken, unless this argument has the value "enhanced", in which case the script will create a file of name 'hostname' (e.g. "cc-102-1.msulocal") in an NPS directory (cf. below) only if all tests pass successfully.
- The common directory for readiness status report
/atlas/data08/condor_ready_dir as currently defined in health.conf
- The Logfiles written by the crontab scripts
- /var/tmp/run_health_evals.log created by the health check scripts
- /var/tmp/gmetric_aglt2_health.log created by the gmetric update script
- The "AGLT2 Health" ganglia metric
This is a new ganglia metric added to the default set provided by rocks. This metric, named "AGLT2 Health", is mainly used in the AGLT2 Compute Summary, but is of course also visible and graphed in each compute node ganglia page.
This metric uses discrete values, 0-6. The value 0 is not used to represent actual error conditions, as it can also correspond to the lack of measurement.
metric value |
AGLT2 Health Status |
0 |
no information (e.g. scripts are not installed, or not running) |
1 |
script failure (e.g. one of the monitoring scripts has hung, or data found stale) |
2 |
an error was detected (but not a dccp or condor error) |
3 |
a dccp error was detected |
4 |
no other error, but condor service not running |
5 |
no error, and condor service running, but node is not set to accept jobs |
6 |
no error, and node is accepting jobs |
The value 2 is meant as a catch all for most of the errors detected, as it might be difficult and counterproductive to bubble up to this interface the many flavors of errors detected from an evolving set of tests. There are 2 exceptions. The node status with respect to condor is shown in more detail because once the nodes have been built successfully, this metric is anticipated to be most useful while enabling condor jobs on the nodes. The dccp copy test was also singled out because dcache access has been perceived as one of the more fragile components of the system.
- Compute Summary web pages
The AGLT2 Health metric is collected and displayed on our
Compute Summary pages. A small colored tile with the letter "H" is added to the entry for each compute node.
The color encoding is the following
Code |
Tile Color |
AGLT2 Health Status |
0 |
white |
no information |
1 |
red |
script failure |
2 |
bright orange |
an error was detected |
3 |
light orange |
a dccp error was detected |
4 |
light green |
condor service not running |
5 |
medium green |
ok, but node is not set to accept jobs |
6 |
dark green |
node is accepting jobs |
Details on Node level scripts
There are two separate sets of scripts running independently and at different times.
- One set of health status scripts evaluates the node health status and runs at a repetition rate sufficiently low to not significantly load the compute node and the network (especially the dccp test). This rate is currently set to 15 mn in /etc/crontab. Note also that the starting time on all compute nodes were randomly staggered during installation to distribute in time all the dccp copy tests.
- A separate ganglia gmetric script uses the output of the health status scripts to frequently update the health status metric. It runs at a frequency adequate for ganglia, currently set to run once per minute. There may be more than one gmetric created in the future, and we might then have more than one gmetric script.
Both sets of scripts are run via crontab entries. In addition, the main health assessment script also runs once at OS installation time in order to insert an initial record of the node health in the build log. The health assessment script may also be run manually, and also in an enhanced mode. The details below are mostly presented from the point of view of the monitoring purpose of all these scripts, but the manual modes are also described where appropriate.
Health Status Monitoring scripts.
One central script
/root/tools/run_health_evals.sh is invoked by crontab. The output of this crontab entry is sent to
/var/tmp/run_health_evals.log
This central script will invoke in alphabetic order all the *.sh shell scripts, then all the *.pl perl scripts found in
/etc/health.d We currently have two shell scripts in
/etc/health.d: /
etc/health.d/eval_T2_condor_ready.sh and /etc/health.d/eval_condor_service_accepting.sh.
eval_T2_condor_ready script
This script is found in /
etc/health.d/eval_T2_condor_ready.sh and performs most of the health assessment. It sequentially verifies each necessary prerequisite to determine if a node is ready for the condor_on command, i.e. ready to receive atlas jobs.
The script creates an output file
/var/tmp/eval_T2_condor_ready containing a single line with a single character representing a coded value corresponding to the result of the tests.
This script can be run in three mode:
- If no argument is present, the script runs in dry mode (aka. monitoring mode), and will never create a remote file (cf. enhanced mode below). The dry run mode is intended to be used to generate an output file subsequently used to update the ganglia metric. In dry mode, the script will first look for the existence of a lock file, and immediately exit if one is found. In dry mode this script will thus not perform any health test if another instance is already running (or hung) in dry mode. The crontab entry (via run_health_evals.sh) runs this script in dry-mode. If this script were to hang during execution, the lock file mechanism described above is preventing crontab from accumulating many more instances of this script which would most likely hang as well.
- With an argument (any value), this script runs in diagnostic mode. In diagnostic mode, this script will not create a logfile and will not be hindered by any lock file (cf. below) that might have been left by an instance of the script running in dry mode. The administrator may manually run the script in this mode and view the screen output to diagnose a problem.
- With the argument "enhanced" this script runs in enhanced mode. In enhanced mode, this script will create a file named in a common remote directory once it has determined that the node is completely healthy, This remote file is then used as a stimulus, or a request to automatically enable this node with the condor_on command. Note that a compute node cannot enable itself, as this command has to be issued on the condor negotiator node. If any of the tests has failed, no remote file is created, and the node will thus not be enabled. This script is run in enhanced mode only manually, and by the site administrator at a few specific times to re-enable a set of nodes. This script is never run in enhanced mode automatically, and compute nodes do not automatically receive a condor_on command, even after a re-install or a reboot. A site administrator is always involved. These scripts were designed to take away the tedious part of checking that the node is ready, but the responsibility of allowing a node to start receiving condor jobs will always rest with the site administrator. There could be one or more reasons why a compute node, even healthy, should not be receiving condor jobs; e.g. it could be used for rocks build tests, or have a known hardware problem being diagnosed. In order to insure that a node cannot be even mistakenly enabled, the condorReadyDir variable in /etc/health.conf can be changed e.g. to "null".
In any mode (dry, diagnostic, or enhanced), the same set of tests is performed, the same output messages and the same exit code is issued.
The set of tests performed by this script is the following:
- Check the /tmp disk partition is mounted. If this test fails, the script writes the value EvalT2CondorReady_no_tmp (set to "1" in /etc/health.conf) in the output file and also returns this same value as exit code.
- Check the /tmp disk partition is large enough. The threshold is set by the "smallestDisk" parameter (set to 225 GB in /etc/health.conf). If this test fails, the script writes the value EvalT2CondorReady_tmp_min_size (currently set to "2" in /etc/health.conf) in the output file and also returns this same value as exit code.
- Check that the openafs cache partition is present. If this test fails, the script writes the value EvalT2CondorReady_no_afs_cache (currently set to "3" in /etc/health.conf) in the output file and also returns this same value as exit code,
- Check that the afs service is running. If this test fails, the script writes the value EvalT2CondorReady_afs_not_running (currently set to "4" in /etc/health.conf) in the output file and also returns this same value as exit code.
- Check that there is sufficient swap space. This is checked 2 ways, first against a fixed threshold expectedSwapSize (currently set to 16GB in /etc/health.conf), then against a proportional threshold corresponding to swapPerCore for each core present (currently set to 2GB/core). If this test fails, the script writes the value EvalT2CondorReady_swap_min_size (currently set to "5" in /etc/health.conf) in the output file and also returns this same value as exit cod.
- Make sure that the /pnfs is mounted There is currently no explicit test, just a command to mount the partition in case it previously was not. No error code is returned corresponding specifically to the absence of /pnfs mount point or a failure of the mount command. The dccp command below will fail and be reported as such.
- Check dCache access by running a dccp command targeting a specific file pnfsFile and checking the copied file for size against the expected pnfsFileSize. Currently the file transferred is DBRelease-7.3.2.tar.gz, of size 291026152 bytes. If this test fails, the script writes the value EvalT2CondorReady_pnfs_fail (currently set to "6" in /etc/health.conf) in the output fileand also returns this same value as exit code.
- Check that the Condor service is started by checking for the existence of a condor_master process. If this test fails, the script writes the value EvalT2CondorReady_condor_not_run (currently set to "7" in /etc/health.conf) in the output fileand also returns this same value as exit code.
- If all tests are passed successfully, the script writes the value EvalT2CondorReady_OK (set to "0" in /etc/health.conf) in the output file and also returns this same value as exit code.
In dry mode, the eval_T2_condor_ready script first checks for the existence of a lock file
/var/lock/eval_T2_condor_ready, and will promptly exit with status
EvalT2CondorReady_OK (set to "0" in the
health.conf file) if this file is present, or create the logfile if it doesn't already exist. This simple mechanism will prevent processes from piling up if they fail to complete normally. The script also writes the current time in the lock file. This time stamp is used by the gmetric_aglt2_health script to first verify the age of the lock file. The lock file is normally automatically deleted when the script exits, whether it found any problem.
eval_condor_service_accepting script
This script is located in
/etc/health.d/eval_condor_service_accepting.sh and performs a single test; it determines if condor is currently set to accept jobs.
This script does not take any argument. The script creates an output file
/var/tmp/eval_condor_service containing a single line with a single character representing a coded value corresponding to the result of the test. The script checks whether the Condor service is accepting jobs by checking for the existence of at least one condor_startd process. If this test fails, the script writes the value
EvalCondorServiceAccepting_not_ready (currently set to "0" in /etc/health.conf) in the output file and also returns this same value as exit code. If this test fails, the script writes the value
EvalCondorServiceAccepting_OK (currently set to "1" in /etc/health.conf) in the output file and also returns this same value as exit code.
In dry mode, the eval_condor_service_accepting script first checks for the existence of a lock file
/var/lock/eval_condor_service_accepting, and will promptly exit with status
EvalCondorServiceAccepting_OK (set to "0" in the
health.conf file) if this file is present, or create the logfile if it doesn't already exist. This simple mechanism will prevent processes from piling up if they fail to complete normally. The script also writes the current time in the lock file. This time stamp is used by the gmetric_aglt2_health script to first verify the age of the lock file. The lock file is normally automatically deleted when the script exits, whether it found any problem.
gmetric_aglt2_health script
This script is found in
/root/tools/gmetric_aglt2_health.sh. its purpose is to review the health assessment performed by the above two evaluation scripts and update the ganglia metrics on the local node. The script also verifies that any information it may find is not outdated.
The gmetric_aglt2_health.sh script first looks for the existence of the lock file (described above) corresponding to each of the two evaluation scripts (
eval_T2_condor_ready.sh and
eval_condor_service_accepting.sh).
- If the lock file exists, the corresponding evaluation script is still running. As described above, the lock file contains the time at which the corresponding script started its processing. The gmetric_aglt2_health script compares the time elapsed since then to a threshold GmetricAglt2Health_lock_timeout (currently set to 2100 s = 35 mn in the health.conf file, which is more than twice the normal repetition rate for the evaluation scripts). If the lock file is found to be outdated, the gmetric_aglt2_health script updates the AGLT2 Health metric with the value GmetricAglt2Health_hang (set to "1" in the health.conf file).
- If the lock file exists and is not considered outdated, the gmetric_aglt2_health script just continues on.
- The lock file will typically not exist, and the gmetric_aglt2_health will then look for the output files that were created earlier.
The gmetric_aglt2_health.sh script looks for the existence of the output file (described above) corresponding to each of the two evaluation scripts (
eval_T2_condor_ready.sh and
eval_condor_service_accepting.sh).
- If either output file does not exist, the gmetric_aglt2_health script script aupdates the AGLT2 Health metric with the value GmetricAglt2Health_no_info (set to "0" in the health.conf file).
- These output file will typically exist, and the gmetric_aglt2_health script compares the time elapsed since each file was created to the same threshold as above ( GmetricAglt2Health_lock_timeout, currently set to 2100 s = 35 mn). If the output file is found to be outdated, the gmetric_aglt2_health script updates the AGLT2 Health metric with the value GmetricAglt2Health_hang (set to "1" in the health.conf file).
- If both output file exists and are not outdated, the gmetric_aglt2_health script reads the coded values found in these two output files.
The gmetric_aglt2_health.sh script then looks at the coded value in the output file from the
eval_T2_condor_ready script.
If this value is not
EvalT2CondorReady_OK, it will specifically look for 3 potential problems that receive specific AGLT2 Health codes, as described in the summary table in an above section:
- If this coded value is EvalT2CondorReady_condor_not_run, the gmetric_aglt2_health script updates the AGLT2 Health metric with the value GmetricAglt2Health_condor_not_running (set to "4" in the health.conf file).
- If this coded value is value EvalT2CondorReady_pnfs_fail, the gmetric_aglt2_health script updates the AGLT2 Health metric with the value GmetricAglt2Health_dccp_error (set to "3" in the health.conf file).
- If it is anything else, the gmetric_aglt2_health script updates the AGLT2 Health metric with the value GmetricAglt2Health_problem (set to "2" in the health.conf file).
If the coded value from the
eval_T2_condor_ready output file is
EvalT2CondorReady_OK, the gmetric_aglt2_health script then looks at the coded value from the
eval_condor_service_accepting output file.
- If this coded value is EvalCondorServiceAccepting_OK, the gmetric_aglt2_healthscript updates the AGLT2 Health metric with the value GmetricAglt2Health_OK (set to "6" in the health.conf file).
- If it is anything else, the gmetric_aglt2_health script updates the AGLT2 Health metric with the value GmetricAglt2Health_not_accepting_jobs (set to "5" in the health.conf file).
A crontab entry invokes this gmetric_aglt2_health.sh script once per minute.
--
PhilippeLaurens - 29 Apr 2010