Auto Test Programs over AGLT2
Cluster Related
PNFS mount point test
Purpose Make sure every computer node has "/pnfs/aglt2.org" mounted , and every gridftp door nodes has both "/pnfs/aglt2.org" mounted and "/pnfs/ftpBase " which is a soft link to "/pnfs/aglt2" exists.
Frequency every 4hours
Alert Sending emails to
aglt2-hardware@umich.edu
cron service
umopt1.aglt2.org (check all computer nodes at UM)
0 0-23/4 * * * /usr/bin/perl /home/install/wuwj_extras/bin/check_pnfs_UM_wn.pl 2> /dev/null 1> /dev/null
msurox.aglt2.org(check all computer nodes and msu fs servers at MSU)
0 0-23/4 * * * /usr/bin/perl /home/install/extras/bin/check_pnfs_MSU.pl 2> /dev/null 1> /dev/null
umopt1.grid.umich.edu run as wuwj which needs to update the afs token every 30 days (check fs servers at UM, because um fs servers don't allow passwordless root ssh )
0 0-23/4 * * * /usr/bin/perl /home/install/wuwj_extras/bin/check_pnfs_UM_sv.pl 2> /dev/null 1> /dev/null
Host Cert expiration check
Purpose check the expiration date of all host certs which are stored on umopt1.aglt2.org
Frequency every day
Alert
Sending emails about expiring certs(less than a month to the expiration) to aglt2-hardware@umich.edu
display expiration date of certs on this web pagemonitor_cert
cron service
umopt1.aglt2.org (check all certs for UM and MSU nodes)
0 5 * * * /usr/bin/perl /home/install/extras/bin/check_cert.pl
Dcache Related
check dead pools of dCache
Purpose check if there are any pools whose status is dead
Frequency every 5 minutes
Alert send emails to wenjing and Shawn if any fs pools are becoming dead..
Cronjob
hea02.aglt2.org
*/5 * * * * cd /root/dcache_adm_script/dCache/check_poolstate;perl report_dead.pl
cleandb
Purpose clean stale db entries from srm database which would stop a user to write a file to dcache with the same name which failed before..
Frequency every 10 minutes
Alert None
Cronjob
head02.aglt2.org
*/10 * * * * cd /root/dcache_adm_script/dCache/clean_sp_db/;/usr/bin/perl cleandb.pl
srm put/get report/statistics
Purpose
stats the successful and failed rate of SRM PUT/Get requests within each space token area
classify error messages
rotate srm requests db (delete entries from 4 hours ago)
Frequency every 4 hours
Alert send email to wenjing, shawn and bob if there are any unusual (fatal )failures..
Cronjob
head02.aglt2.org
0 0-23/4 * * * cd /root/dcache_adm_script/dCache/srm_err_report; perl report.pl
stat_fileno
Purpose compare the file numbers from the pool cell and the file numbers registered in PNFS DB, see if any pools failed to register to PNFS
Frequency every day
Alert display the File numbers of each pool cell and registered DB in this monitor page
FileNO_Stat
Cronjob
head02.aglt2.org
0 8 * * * cd /root/dcache_adm_script/dCache/stat_fileno_inpool; perl stat_fileno.pl poollist
Stat Usage of Typical Pools
Purpose
for each space tokens, list all affiliated pools and their usage.
for each fs nodes, list all its pools's group and their usage
Frequency every day
Alert display the stat in this webpage
Typical Pool Usage
Cronjob
head01.aglt2.org
0 */8 * * * cd /root/dcache_admin_script/stat_dcache_pools;perl stat_poollist.pl
Purpose
Frequency
Alert
Cronjob
--
WenjingWu - 11 Dec 2008