Fixing OMD Setup for AGLT2 Use-cases
There are a number of issues we are hitting as we try to setup OMD for AGLT2 use. Below are the current list of issues. When we have solutions we will also record the details here.
- Some network interfaces get listed as 'CRIT' because they are down/unknown. . See http://omd.aglt2.org/atlas/check_mk/view.py?view_name=service&site=&service=Interface%2008&host=umfs09 for an example. This case shows 'p1p1' is a interface setup to host one or more VLANs (p1p1.1623 or p1p1.1624). SOLUTION: This actually is DOWN and should be removed from UMFS09 (it was for SC13)
- The filesystem checks have default settings of 80/90% for warn/critical. Some /boot areas are managed and run close to full. See http://omd.aglt2.org/atlas/check_mk/view.py?view_name=service&site=&service=fs_/boot&host=umfs09 for an example. SOLUTION: Reconfigure the umfs09.mk file to add a check_parameters configuration:
-
check_parameters +
[ ( (98,99), [ "umfs09" ], [ "fs_/boot" ]), ]=
- The power supply on some of our Dell Powerconnect devices is causing an alert like 'CRIT - Condition of PSU "System" is notFunctioning, with source Alternating Current'. This power supply doesn't exist. Instead there is one call "Main" (rather than "System"). SOLUTION: We need to ignore this "found" P/S by using the ignore_services configuration. NOTE: after changing the config file you need to re-inventory using 'cmk -II <host>'
-
ignored_services +
[ ( [ "switch", "dell" ], [ "sw11" ], [ "Sensor System$" ] ), ]=
- Problem with 'WARNING - 1 unchecked services (ipmi:1)' (or similar). The issue is that during the first inventory something went wrong. SOLUTION: Run the the inventory, clearing out old stuff
'cmk -uII <host>'
- Some services require "local" agents that need to be added on the systems to be monitored. The OMD host has a bunch in
/opt/omd/versions/1.00/share/check_mk/agents/plugins/
. I have moved these to ~smckee/public/check_mk/agent/plugins
. The action item here is to package these in RPMs and only install the ones we need. The check_mk-agent-logwatch src RPM might be a useful template. The destination on the monitored system should be /usr/lib/check_mk_agent/plugins/ with any associated cfg file in /etc/check_mk.
- Add check_dell_bladechassis to the omd server (see http://flakrat.blogspot.com/2011/11/using-checkdellbladechassis-with.html ). Follow those instructions to install (you will need to copy the check_dell_bladechassis binary and php templates to the right OMD locations). NOTE: Having a problem getting all the pnp4nagios graphs to show up?! Needs debugging
- There doesn't seem to be a template for the APC ARUs. Needs to be created. I did download the powernet409.mib and put it in /usr/share/snmp/mibs on omd.aglt2.org
- Sometimes the nagios.cfg in
/omd/sites/atlas/tmp/nagios/nagios.cfg
disappears and causes an error in 'cmk -O'. I saved a copy in /opt/omd/sites/atlas/etc/nagios/nagios-tmp.cfg
. Can recreate by setting Nagios as base and restarting.
- In innovation version check_mk-1.2.3i7p2 the apc_powerswitch inventory worked. It fails in check_mk-1.2.4b1 and b3.
- I created a list of rac (Remote Access nodes) on subnet 10.10.0.0/24 for check_mk as follows
- nmap -v --system-dns -T2 -sn 10.10.0.0/24 | grep -B1 "Host is up" | grep "^Nmap" | awk -F'\. ' '{print $5}' | grep rac | awk -F\. '{print $1}' > rac-hosts.txt
- rm -f /tmp/rac*.mk
- for n in `cat rac-hosts.txt`; do cp /root/rac-template.mk /tmp/$n.mk; sed -i "s/HOSTNAME/$n/" /tmp/$n.mk; done;
- Copy these as user 'atlas' to ~etc/check_mk/conf.d and run 'cmk -I rac'
- After updating the firmware on dell UPSes they no longer support snmp v2c. Revert back to using snmp v1 for these checks by adding the "nobulk" tag to the appropriate ups entries. I also had to manually re-enable snmp after the update.
For configuring check parameters see
http://mathias-kettner.com/checkmk_check_parameters.html
Testing "Innovation" Version of Check_MK in OMD
The default OMD configuration installs the stable version of check_mk. However the current "innovation" version has support for VMware. To try this out I downloaded the newest tar-ball (
http://mathias-kettner.de/download/check_mk-1.2.4b1.tar.gz ). This needs to be put someplace the "site" user can see it. I put it in /tmp.
The site user needs to unpack it, cd into the directory and run ./setup.sh --yes. When I did this as user 'atlas' I found I didn't have the needed gcc-c++ compiler and needed to run 'yum install gcc-c++' as 'root'. I followed these instructions:
How to install Check_MK into an exisiting OMD site
This feature is experimental:
You can install Check_MKinto an existing OMD site by unpacking the installationtarball and running ./setup --yes as OMD site user.
This will install Check_MK into the site's local/filesystem hierarchy.
After that relogin (to make sure the shell reallyexecutes the new cmk command) and do a cmk -U in orderto re-create your Nagios configuration.
Uninstalling must be done manually. This can bedone with the following commands provided that noother files you need are lying around below local!Use at your own risk!
omd stop
cd ~
find local -type f | xargs rm
cd etc/apache/conf.d
ln -sfn ../../check_mk/apache.conf check_mk.conf
cd ~/etc/nagios/nagios.d
ln -sfn ../../mk-livestatus/nagios.cfg mk-livestatus.cfg
rm -f ~/etc/check_mk/apache-local.conf
rm -f ~/etc/mk-livestatus/nagios-local.cfg
(logout and log back in)
cmk -U
omd start
The first issue I hit is that the core monitoring should be setup for Nagios (Icinga and Shinken caused problems for cmk -O).
Next there is a problem with this version and 'cmk -O'. If you run that the Livestatus fails. There is supposed to be a socket in ~/tmp/run/live and this disappears. The fix is to do 'cmk -R' (but this takes quite a bit longer).
After that things seem to work and it allowed me to install VMware monitoring following
http://mathias-kettner.de/checkmk_vsphere.html
--
ShawnMcKee - 09 Dec 2013