Monitoring: AGLT2 Compute Summary Page
The initial idea was to simply extract and rearrange some lines of the HTML Ganglia
page for the MSU site, in order to reorder the load_one graphs into a compact
table that would mimic the physical racks layout.
This project evolved into an AGLT2-wide summary page gathering in one place
Condor, Panda, and Ganglia information relating to the compute production and analysis compute load.
Page Location
A python script on a desktop computer collects information from
the various sources and updated the MSU dept web server:
AGL Compute Summary Page
This page may move to www.aglt2.org in the future.
Panda Info
The top section copies the job summary table for the production site AGLT2, and
the job summary table for the analysis site ANALY_AGLT2; both are taken from panda.cern.ch.
The integration time of 6 hours was chosen for the tables as a compromise to
show the current success/failure rate of the compute jobs with reasonable
statistics (e.g. 3 hours seemed a bit short, and 24 hours averaged errors
over too many jobs).
The title name of each table (e.g. "Panda Details for AGLT2 (6h)") links to the source Panda page.
Condor Info
This graph is what is displayed in the Condor Job Status page at
gate01.aglt2.org, showing the number of running and queued jobs for production, analysis, and T3 jobs.
The graph links to the source Condor Status Page
Panda Graphs
These graphs show the 24 hour plots of Panda Jobs for AGLT2 Production
and ANALY_AGLT2 Analysis queues.
These graphs are taken from the site summaries at gridinfo.triumf.ca
Each graph links to the source Panda page including the same graphs for Hour/Day/Month/Year.
Ganglia Info
The first implementation was just extracting and reordering the HTML data from the
"MSU T2" ganglia page into a table following the physical arrangement of the nodes.
The next step was to realize that it would be rather easy to change the color coding
ganglia uses to illustrate the degree of usage of each node. The range values ganglia uses (0-25-50-75-100-100+) are not very useful to us, as we would like more emphasis
and discrimination around the 100% value. The color coding is also counter-intuitive from
our perspective, as ganglia shows loaded nodes in orange and red, while we would
consider this the "good" state and would rather see it green. The script thus replaces
these ranges and colors with new ones:
load_one range |
color |
label |
00.00 - 00.85 |
white |
idle |
00.85 - 04.00 |
light blue |
lightly loaded |
04.00 - 07.60 |
light green |
below potential |
07.60 - 08.40 |
dark green |
matched load |
08.40 - 12.00 |
light orange |
overloaded |
12.00 - 16.00 |
dark oragen |
trouble? |
16.00+ |
red |
ouch! |
Each node is represented by a cell of the corresponding background color and filled with
the ganglia load_one graph of the same background color scaled down to the cell size.
The shrunken graphs still give (once you know what you are looking at) some sense
of time history of the node's activity.
note: The ganglia graph re-coloring is made possible and quite simple by the realization
that both the load_one value, and the graphs color are passed to the ganglia server
which generates each graph.
When a node is flagged "down" in ganglia, the summary page will show the number of seconds,
then days since the node was last heard from.
The MSU rack arrangement of compute nodes is quite regular and uniform, and the script expects to
find nodes with particular addresses and thus flags the slots used for each "Switch",
and will find missing nodes as "Missing?".
The UM rack arrangement of compute nodes is less uniform, and the script solely relies on the rocks
naming convention xx-racknum-slotnum to determine the location of each node, without knowing
that nodes are missing (unless they are flagged "down").
The naming for blade nodes does not follow such geographic nomenclature and they are
thus treated as a special case with knowledge of their rack location. (note: there
is no HTML directive I know to rotate an image, and the ganglia graphs for blade nodes are
thus not very readable. I may explore using JavaScript in the future.)
Each compute node cell within a table links to the node's Ganglia page.
Each 24hour UM or MSU ganglia graph links to the source Ganglia page.
Refresh Rate
The script updates the HTML file every 60 s.
The HTML file is set to auto-refresh in your browser every 5 mn.
Routing and Caching
The aglt2.org domain is not reachable from everywhere. This means
that e.g. all Ganglia graphs are out of reach from a typical home network.
Even when aglt2.org is unreachable, the UM and MSU Ganglia Rack sections will
still show the color coded load_one bin of each node, and the summary table
will still show the relative number of nodes in each bin range. The main added value
of this page over the raw ganglia information is thus still achieved even though the per-node
ganglia graphs themselves are not present.
In order to make all summary graphs available when aglt2.org is not
reachable, the Condor status graph and the MSU and UM Ganglia 24hour summary graphs
are copied to and served from the MSU dept web server.
--
PhilippeLaurens - 04 Jun 2009