Comments on: Setting up fully redundant failover nagios servers

By: Jorge

Jorge — Mon, 05 Jun 2017 20:23:11 +0000

Worked great!!

Thank you =)

By: bish

bish — Wed, 01 Feb 2017 20:27:48 +0000

‘cat | sed ‘ ?

By: ilie dumitru

ilie dumitru — Mon, 09 Jan 2017 08:36:12 +0000

I do it some years ago, in a following way:

Behind nagios work httpd and mysql, so there is the key of solution

A virtual ip address positioned on the master
The mysql slave, having data every 2 second from master’s bin-log (mysql replication)
A perl script, watching the master to know the state of services (ip up, httpd, mysqld etc) and if problems, can connect on the slave, restart the services, rise the virtual ip address, change my.cnf, and restart mysql server as master. The services Nagios are restart after
The perl script, positioned on a third server, can connect with public key declared in .ssh/authorized_keys

The nrpe conf files of all machines watches, must accept the 2 addr of nagios server (master and slave)

It works fine

By: Praveen Diwakar

Praveen Diwakar — Wed, 24 Aug 2016 09:07:43 +0000

hi
I am trying HA solution quite a while but getting following error.

I have configured the two nagios servers as instructed above . I have been able to sync the retention .dat file. My problem is that I am using it in private network means a private IP [192.168.x.x] ,and when I run the nagios.watchdog.sh.erb script it give me “host key verification error” .

I have removed the known host files but still same error. I am able to ssh both servers without password.

Please give me solution on this!!!!!!!!!!

Thanks

By: Pedro Albuquerque

Pedro Albuquerque — Fri, 14 Oct 2011 16:52:35 +0000

Hi,

I think “retain_state_information=1” is just to retain information before its shutdown and not to sync every minute.
In Nagios Core 3.2.3, there is the variable “retention_update_interval” which determines how often (in minutes) that Nagios will automatically save retention data during normal operation.
Does anyone already figured out which variable should be configured to retain information every minute?

Cheers.

By: Michael Edwards

Michael Edwards — Thu, 28 Jul 2011 17:05:49 +0000

As of 3.2.3 there is a separate option called “retention_update_interval” in addition to the retain_state_information option mentioned in this blog posting.

This is what I ended up after making some changes for greater compatability with my somewhat default nagios install as well as RHEL3/5 compatability.

#!/bin/bash

# Executable variables. Useful.
RM=”/bin/rm -f”
MV=”/bin/mv”
ECHO=”/bin/echo -e”
FQDN=”/bin/hostname –fqdn”
FIXFILES=”/sbin/fixfiles”
MAILER=/usr/sbin/sendmail
SUBJECT=”URGENT: nagios master process switch has taken place.”
RECIPIENT=”isadmin@vtls.com”
SERVICE=/etc/init.d/nagios
RETENTIONFILE=/usr/local/nagios/var/retention.dat

# This is where we point the servers at each-other (configure this properly in your deployment!)
#This should be for the other server of the pair
MASTERHOST=10.0.0.2

# Ensure only one copy can run at a time
PIDFILE=/var/run/nagios-watchdog.pid
if [ -e ${PIDFILE} ]; then
exit 1;
else
touch ${PIDFILE};
fi

# Checks the actual daemon status on the other host
#echo “su – nagios -c \”ssh ${MASTERHOST} ‘/etc/init.d/nagios status’\””
su – nagios -c “ssh ${MASTERHOST} /etc/init.d/nagios status”
#>/dev/null 2>&1

# Is the other host doing all the work?
if [ $? -eq 0 ]; then
# Service running on MASTERHOST. Stop my service so there is only one.
#echo “Nagios running on MASTERHOST”
#echo ” ${SERVICE} stop ”
${SERVICE} stop >/dev/null 2>&1

# Copy the retention data from the other nagios process
#echo “su nagios -c \”scp ${MASTERHOST}:${RETENTIONFILE} /tmp/\””
su – nagios -c “scp ${MASTERHOST}:${RETENTIONFILE} /tmp/”;

# Verify that we didnt get a corrupted copy
if [ `grep “{” /tmp/retention.dat | wc -l` -eq `grep “}” /tmp/retention.dat | wc -l` ]; then
${MV} /tmp/retention.dat ${RETENTIONFILE};
else
${RM} /tmp/retention.dat;
fi
#${FIXFILES} restore /var/log/nagios
else
# echo “Service not running on MASTERHOST”
${SERVICE} status >/dev/null 2>&1
if [ $? -ne 0 ]; then
# echo “Service not running here either. Sending notification.”
${ECHO} “From: nagios-watchdog@`hostname`\nSubject: ${SUBJECT}\nTo: ${RECIPIENT}\nNow running on host: `hostname`” | ${MAILER} ${RECIPIENT};
# echo “Starting nagios on localhost.”
${SERVICE} start >/dev/null 2>&1;
fi
fi

${RM} ${PIDFILE}

exit 0;

By: brose

brose — Tue, 03 May 2011 13:32:49 +0000

Aaron,
Thanks! Glad it was of use. I ended up going with the curl route as it did not require me to compile any custom SELinux modules. If you do the ps method, you will need to somehow enable httpd access to read the process table. You are correct, however – both ways will work well.

By: Aaron

Aaron — Mon, 02 May 2011 21:57:55 +0000

This guide was very helpful!

One addition I made was to check for running nagios by using exec ps -C nagios3 instead of the curl operation in your script. Both ways seem to work well.

By: brose

brose — Thu, 14 Apr 2011 20:50:47 +0000

Jason,

All of my testing and work has been with Nagios core, the free open-source version available from rhel/epel and the like. However, as long as Nagios XI still uses the retention.dat mechanism for restoring it’s exit state on startup, this should work. In fact, it may be even easier, as it sounds like you can point them both to the same database for config, and don’t need to worry about syncing with puppet or some such. I cannot provide any information on NDOutils or any plugins, it all depends whether or not they store their retention information in retention.dat. There is no reason why my script cannot be modified to scp multiple files, though. For example, I have extended it in our environment to also copy over the logs/ directory. This syncs historical data, making end-of-year reports reliable.

By: Jason

Jason — Thu, 14 Apr 2011 20:32:37 +0000

is this just for Core or will this work for Nagios XI using databases for configs and NDOutils?