
linux-ha at mm
Jul 1, 2008, 8:55 AM
Post #2 of 2
(119 views)
Permalink
|
|
Re: Strange HB Status displayed for root vs. unprivileged users; bug or feature?
[In reply to]
|
|
On Tue, Jul 01, 2008 at 04:04:54PM +0200, Ralph.Grothe[at]itdz-berlin.de wrote: > After I had successfully upgraded this cluster to the new OS I was > wondering, why my Nagios plugin always returned CRITICAL states > though heartbeat was running on the node at the time. > Then I discovered that the output of my check command differed > decisively depending on who executed the check. > > e.g. as root I get > > # /usr/lib64/nagios/plugins/custom/check_heartbeat.sh > OK - heartbeat is running on nodeA > > or rather what really gets executed in that plugin and whose > output merely gets parsed is > > # /usr/lib64/heartbeat/heartbeat -s > heartbeat OK [pid 31017 et al] is running on nodeA [nodeA]... > > # pgrep -P1 -fl heartbeat > 31017 heartbeat: master control process > > But when run as an unprivileged user, as is the case when the nrpe > daemon is executing the check, oops, I get this strange result > > # /usr/lib64/nagios/plugins/check_nrpe -n -H localhost -c check_heartbeat > CRITICAL - heartbeat is stopped on nodeA > > How come, is this a bug or intended behavior? I've just had a quick look through the source to see what the -s flag actually does (I'll need to set up monitoring of heartbeat in Nagios shortly, as it happens). It reads the PID file and then checks if the process is running, and that the process with the PID it's checking is actually heartbeat (by checking that its /proc/.../exe is a link to the heartbeat binary). On my system, even though the process directory and the symlinks therein appear to be world-readable, they're not: $ ls -la /proc/`sed 's/ *//' /var/run/heartbeat.pid` ls: cannot read symbolic link /proc/18467/cwd: Permission denied ls: cannot read symbolic link /proc/18467/root: Permission denied ls: cannot read symbolic link /proc/18467/exe: Permission denied When heartbeat tries to ascertain that the process running with that particularly pid is actually heartbeat, it encounters an error and therefore fails. I'm not sure if this aspect of the proc filesystem's behaviour can be adjusted, or if it's desirable to adjust it. So, I would suggest one of: 1. Go with your approach of just checking the process listing 2. Set up sudo or similar so Nagios can do the check 3. Set up a scheduled job to do a check as root, and write the result status code and a line of output to a file somewhere. Then the Nagios check command can check that the status file was updated recently, and if so use that for its own response. I'll probably go with option #2 or #3, but I haven't really looked into how exactly I'm going to ascertain that heartbeat is up and running. Possibly I'll use crm_mon -1 and check that the expected nodes are both online, and set a warning status if either is offline (and critical if I can't work out their status at all). _______________________________________________ Linux-HA mailing list Linux-HA[at]lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
|