Using Grafana over one year

After a year of using Grafana + Graphite + Collectd monitoring system, it proved to be a useful and flexible solution.

Instead of grafana alerting by e-mail, we started to use a dashboard with Vonage Status Panels displayed all the time in our office. We monitor servers used mainly for development and testing, so it is sufficient even if we don’t notice the alerts immediately. And it is very good for diagnosing problems on the servers.

One problematic thing is that collectd suspends plugins which fail with error, and the interval before retrying is doubled every time up to 24 hours! Custom plugins (Exec, Python…) can handle all possible errors, but namely this is problem of built-in database plugins configured to retrieve data from a remote database. If the remote database (or host) doesn’t run for longer time, the plugin fails with error and metrics from the database will start to be collected another long time after it is running again.

Also at start it was not clear how to setup status panel. It often receives null value for the most recent data point. It must be set to aggregate few more data points by option Time range > Override relative time and other type of aggregation than “Last” must be used in Options > Aggregation for all metrics.

Status panels

Statistics of our dashboard:

  • 24 servers monitored with Vonage Status Panels
  • 154 metrics in total
  • 6 metrics monitored on a server by average
  • A few metrics displayed all the time (load average, uptime…)
  • Other metrics only used for panel alerts

Monitored metrics

Some metrics are monitored on all servers, but other are specific for the server type:

Almost all servers

  • Server is running (metrics are collected)
  • Disk usage
  • 15 min load average
  • Uptime

XiVO

  • Asterisk uptime
  • Active Calls 15min Average
  • Calls/Min 15min Average

These statistics are retrieved from Asterisk by collectd exec script.

XiVO CC

  • Time of XUC server connection to Asterisk
  • Data can be queried from Elasticsearch
  • Calls are replicated to Elasticsearch database (test xivo-db-replication)
  • Real-time logs are replicated (test xivo-db-replication)
  • Periodic stats are calculated (test xivo-full-stats)
  • Specific stats are calculated (test pack-reporting)

The databases are checked by collectd postgresql and elasticsearch plugins.

Integration and stress tests

Simulations and various tests run on our Demo, Load test and Dev platforms.
Status panels and Tables are useful to monitor that:

  • Simulations and tests are running
  • Tests are passing

Physical servers

  • Temperature

Other

These panel alerts were added to resolve temporary issues on some servers, but we keep them:

  • CPU overload
  • Disk I/O time in percent
  • Network traffic

Status panels

Server monitoring

Status panel label and panel alerts can contain hyperlink to open a dashboard with detailed graphs for the server.

This dashboard uses a variable with all server names and a drop-down menu to choose a server. For the selected server it displays these graphs:

  • CPU load
  • Memory usage
  • Disk usage
  • Disk I/O
  • Swap usage
  • Swap I/O
  • Network traffic

Specific dashboards

We have also similar dashboards for Asterisk, XUC and test results for displaying graphs and tables with specific metrics for the type of server or test results. Some alerts on status panels open these dashboards when clicked:

Asterisk monitoring

XUC monitoring

High-level test results

Gatling test results

Share