Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store availability data for hosts #1

Open
mohierf opened this issue Jul 9, 2015 · 10 comments
Open

Store availability data for hosts #1

mohierf opened this issue Jul 9, 2015 · 10 comments

Comments

@mohierf
Copy link
Contributor

mohierf commented Jul 9, 2015

NOTE still some fixes to be made ... do not use on production servers !


The module manage host_check_result broks to compute and store availability data for all known hosts on a daily basis.

For every day, a document is stored in the availability collection with following fields :

  • hostname/service
  • day (YYYY-MM-DD) and day_ts (timestamp representing day at 00:00)
  • first received check state and timestamp
  • last received check state and timestamp
  • period for 0 state (UP)
  • period for 1 state (DOWN)
  • period for 2 state (UNREACHABLE)
  • period for 3 state (UNKNOWN)
  • period for 4 state (UNCHECKED)
  • host has been in downtime : 0/1

The sum of the 5 stored periods is always 86400, as the number of seconds per day. Before the first received check, the host is considered as in an UNCHECKED period, as well as after the last received check.

The Shinken WebUI uses this data collection to display availability information for each host (see shinken-monitoring/mod-webui#260).

@mohierf
Copy link
Contributor Author

mohierf commented Jul 9, 2015

NOTE still some fixes to be made ... do not use on production servers !

@maethor
Copy link

maethor commented Jul 22, 2015

To be sure I understand well. This collection is updated every time the mongo-logs get a new log for the hostname/service. So "period for UNCHECKED" is initialized to 86400, and decremented when we increment the others values? Am I right?

So when we query the availability from the WebUI, we only compute percentages of 86400?

Is it computing availability for all services, or only for hosts?

@mohierf
Copy link
Contributor Author

mohierf commented Jul 22, 2015

You are right ... it is almost a real time information :-)

At the moment, I only implemented host checks but it will be reaaly simple to make it for all services.

I noticed some problems with this simple strategy :

  • you do not always get 100% of 86400 seconds, because first and last checks in the day are not received at 00:00 and 24:00 ... so you lose fews seconds every day!

  • you can not have availability information for periods smaller than a day

    I have some ideas to cope with the first problem ... but I am not yet sure what is the best strategy ... to be discussed! @maethor

@maethor
Copy link

maethor commented Jul 22, 2015

I plan to review entirely the source code of you plugin (to remove some if len(list) > 0:, for example :D), so in a few hours I will be happy to bring you some suggestion on the strategy :)

Availability for small period is quite hard. In fact, the best strategy to manage such things is the one used by perfdata databases. It consists in having precise information for the last hours, and then to aggregate the information more and more as the time goes. This is nice because we don't have to put any limit, and we are sure that the database size will not explode. But on the other hand, it can complexify a lot the implementation.

But I think I already have an idea to do this… :)

@mohierf
Copy link
Contributor Author

mohierf commented Jul 22, 2015

Feel free to restart from scratch ...I simply made a moke-up to validate an idea that was to compute on the fly instead of parsing a big logs table in a database :-)

@maethor
Copy link

maethor commented Jul 22, 2015

There is no need to restart from scratch. Your proof of concept is great :)

@bittrance
Copy link

What is the status of this feature? I see that building from latest that there is still no service-based availability in my mongo log. I am somewhat keen on implementing this. @mohierf, @maethor: any ideas/thoughts you want to share?

@mohierf
Copy link
Contributor Author

mohierf commented Aug 24, 2016

@bittrance : as far as I remember (it's been quite a long time ...), you should have information for the hosts and the services.

The module log some information on start in the brokerd.log to inform about what it will manage. And you have some configuration parameters to include/exclude some services from the recording ... perharphs something to configure on your environment ?

I left this issue opened because @maethor had an idea for rewriting some part of the code.

@bittrance
Copy link

bittrance commented Aug 25, 2016

Indeed. Explicitly setting a serivces_filter resolves the issue. The text in the module config file says "default is to consider only the services which business impact is > 4". However, since services_filter is commented out in default config, https://github.com/shinken-monitoring/mod-mongo-logs/blob/master/module/module.py#L154 will actually leave filter_service_criticality unset, which means https://github.com/shinken-monitoring/mod-mongo-logs/blob/master/module/module.py#L373 will be bypassed. Which is right? should the default be services_filter = getattr(mod_conf, 'services_filter', 'bi:>=4') or should the docs in config file change?

@mohierf
Copy link
Contributor Author

mohierf commented Aug 25, 2016

Because services_filter is commented out, it takes the default value defined in the source code and it is ... an empty string :(

You are right, we should change the doc in the configuration file !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants