At work, we use Icinga (a fork of Nagios) for monitoring. We have a few services which are restarted or otherwise poked by event handlers, but the recovery takes a while - so we often get paged for problems which recover in a few minutes. I wrote a small perl script that greps through the archived log files for a given regex (service and/or host name) and then calculates the time from problem to recovery and graphs those times.

The script is called nagios_log_problem_interval.pl and can be downloaded from my github. Below is some sample output, the number of minutes from problem to recovery are along the Y axis and the count is along the X axis:

> nagios_log_problem_interval.pl --archivedir=/var/icinga/archive --match=myhost --backtrack=10 myhost;HTTP
Count
1:########(8)
2:##(2)
3:#(1)
4:##(2)
5:#######(7)
6:(0)
7:(0)
8:#(1)
9:(0)
10:(0)
11:#(1)
12:(0)
13:#(1)
14:(0)
15:(0)
16-29:(0)
30-59:(0)
60+:(0)


Comments

comments powered by Disqus