Using "Monit" for Monitoring and Repairing AGLT2 Services
NOTE: THIS PAGE IS NOW MOSTLY OBSOLETE, WITH MONIT INSTALLED VIA CFENGINE
The
monit application monitors and "repairs" host and service problems. It is easy to deploy and configure. If you have the
DAG repo setup in yum simply do:
yum install monit
Monit has many built-in options for testing resources and services. See
man monit once it is installed for an overview.
"Monit" Configuration
The default setup installs the current system with 'monit'. The default is to create a 'monit' service (which is chkconfig'ed off) but it is more robust to remove this service and use
inittab:
chkconfig --del monit
Then edit /etc/inittab and append something like:
#+SPM monit daemon 09Apr2009
mo:2345:respawn:/usr/bin/monit -Ic /etc/monit.conf
Before "starting" this we need to fix the default configuration.
The config file is
/etc/monit.conf and the following lines are the ones to make sure are present (and suitably customized for your install):
set daemon 60
set logfile syslog facility log_daemon
set mailserver 10.10.1.3, umopt1.aglt2.org, umopt1.grid.umich.edu, localhost
set eventqueue basedir /var/monit slots 100
set alert aglt2-hardware@umich.edu
set httpd port 2812 and use address linat02.grid.umich.edu
allow admin:<pw_removed>
allow 10.10.0.0/22
allow 192.41.230.0/23
allow 141.211.0.0/16
ssl enable
pemfile /etc/grid-security/monit.pem
check system linat02.grid.umich.edu if loadavg (5min) > 4 then alert
include /etc/monit.d/*
Some quick comments on the options in the monit.conf file above. First you need to
set mailserver <Your_smtp_server>
to be an appropriate and accessible mail server from this host. As you can see you are allowed to provide a
list of servers. The
set alert <email_address>
should be configured to use an appropriate email destination.
The
set httpd line needs to be setup for this host. Put in your own password for the 'admin' user. NOTE: protect this file so only 'root' can read it! You can also specify the hosts/subnets which are allowed to connect. To enable "ssl" you add
ssl enable but NOTE this requires a pemfile line (as shown). If you have host certificates already you can create the 'monit.pem' file as follows:
- Copy the hostkey.pem cp /etc/grid-security/hostkey.pem /etc/grid-security/monit.pem NOTE doing this gives the monit.pem file the right protection.
- Add the hostcert.pem cat /etc/grid-security/hostcert.pem /etc/grid-security/monit.pem
The
check system line also needs to be customized using the install host name. The last line "includes" whatever other configurations you want to apply to this host. This is nice and creates a "plugin" environment where you can supply common service, device or resource configurations that can be easily shared between
monit configurations.
To start monit via the inittab simply do telinit q
Managing Monit on AGLT2 Nodes
The 'monit' service is very
persistent and if you turn off services it is monitoring it will quickly restart them (and/or alert on that fact). If you need to change the state of a service "manually" be sure to reconfigure
monit to disable monitoring for that service. This can be done via the web interface (see list below) or via the
monit command line interface.
Here are some useful
monit commands:
monit -t # This tests the current configuration's syntax for validity
monit status # Gives information on monit's status (details of what it is monitoring and their status)
monit unmonitor <x> # Turns off monitoring for <x>
monit reload # Reload the (updated?) configuration
monit -h # Get list of commands possible
Also if you update or change services that 'monit' is watching you
may need to update the corresponding configuration in
/etc/monit.d/. If you don't AND something about the service configuration is different after your change, 'monit' may complain or fail to properly handle this service until you fix the config.
Current "Monit" Service/Resource Configurations
For AGLT2 we are primarily monitoring the following services and resources:
- MySQL via a msyqld.conf configuration. Needs customization for the PID file, MySQL port and MySQL socket. Will restart the 'msyql' or 'mysqld' service as required if the service fails.
- ntpd via a ntpd.conf configuration. This one is fairly generic and shouldn't require customization. Checks the ntp service directly on udp port 123 as well. Will (re)start ntpd as required if it is not running or fails.
- Root filesystem via filesystem.conf configuration. This is also generic and shouldn't require customization. Monitors the '/' filesystem and alerts if the flags change (e.g. changes to RDONLY) or if the disk usage goes over 98%.
- LFC via lfcdaemon.lfc configuration. Monitors the lfcdaemon process and the lfc log file. Can restart the lfcdaemon if either the CPU usage is > 80% or the log file has not been updated in 60 minutes. Alerts are sent if CPU usage > 60% or the log file isn't changing in 5 minutes.
We need to create additional configurations for the following:
- httpd and/or apache
- globus-gatekeeper
- tomcat-55
- dCache services --- there is a large list of possibilities here
List of "Monit" URLs for AGLT2
NOTE: These are only accessible from AGLT2 IPs!
- linat02 monit services (AFS DB server, GUMS server, NIS/KRB5 server)
- linat03 monit services (AFS DB server, GUMS server, NIS/KRB5 server)
- linat04 monit services (AFS DB server, GUMS server, NIS/KRB5 server)
- linat05 monit services (Web Server, CFengine server)
- linat06 monit services (AFS File server)
- linat07 monit services (AFS File server)
- linat08 monit services (AFS File server)
- gate02 monit services (Globus Gatekeeper)
- gate01 monit services (Globus Gatekeeper)
- lfc monit services (AGLT2 LFC server)
- dq2 monit services (AGLT2 DQ2 server)
--
ShawnMcKee - 09 Apr 2009