MONITOR NGINX WITH TELEGRAF, INFLUXDB, AND GRAFANA

I hope you have some idea about Telegraf, InfluxDB & Grafana. In this article, I am going to explain only how to build a graph for request counts for each response code to create an alert if you have a lot of requests are failing.

Disclaimer: I am not expert in Telegraf, InfluxDB & Grafana. You can call this implementation as small Hack to tackle the matter.

For proper monitoring, you can ignore this article add following entries in telegraf conf and give proper permissions for nginx access log. That’s it. You are good to go.

[[inputs.nginx]]
     urls = ["http://localhost/nginx_status"]
[[inputs.logparser]]
  files = ["/var/log/nginx/access.log"]
  from_beginning = true
  name_override = "nginx_access_log"
  [inputs.logparser.grok]
    patterns = ["%{COMBINED_LOG_FORMAT}"]

The problem I faced with above approach is that telegraf almost consuming one core CPU. The application which I am working on right now process 10,000 requests in a minute and push them to Kafka consumes half of one CPU core. So, it doesn’t make any sense for me to allow that much CPU for telegraf which is for monitoring purpose. But at the same time, I want to do monitoring and want to put alerts if requests are failing.

Then I thought why shouldn’t I insert/push counts to influxDB. Based on that I can put some alerts. Then I used below command to get request counts based on response codes which were processed in last minute

tail -30000 /var/log/nginx/access.log | awk -v date=$(date -d '1 minutes ago' +"%d/%b/%Y:%H:%M") '$4 ~ date' | cut -d '"' -f3 | cut -d ' ' -f2 | sort | uniq -c

This command will give ouptut like this

6540 200
4    301
28   304
11   400
6    404
2    408
62   499
51   504

Now the problem is how I insert/push this to influxDB. Then I found a way Exec input plugin. But now the problem is how I need to convert this to the format accepted by telegraf.
Then I extended above command little bit. Now I wrote a small shell script

echo [
tail -30000 /var/log/nginx/access.log | awk -v date=$(date -d '1 minutes ago' +"%d/%b/%Y:%H:%M") '$4 ~ date' | cut -d '"' -f3 | cut -d ' ' -f2 | sort | uniq -c | awk 'NR > 1 { printf(",\n") } {printf "{\"resp_code\":%s,\"count\":%s}", $2,$1}'
printf "\n"
echo ]

This gives output like this

[
{"resp_code":200,"count":6540},
{"resp_code":301,"count":4},
{"resp_code":304,"count":28},
{"resp_code":400,"count":11},
{"resp_code":404,"count":6},
{"resp_code":408,"count":2},
{"resp_code":499,"count":62},
{"resp_code":504,"count":51}
]

I did following entry in telegraf.conf to capture this

[[inputs.exec]]
  commands = [
   "sh /home/telegraf/nginx_stats.sh"
  ]
  timeout = "5s"
  name_override = "nginx_stats"
  data_format = "json"
  tag_keys = [
    "resp_code"
  ]

Done. Now in influxDB at nginx_stats measurement, you will get your every minute counts. In this process, I got nginx access log permission issue but that was easily solved by the answer provided in this link. In this process, I found one more interesting software i.e., ngxtop. Check ngxtop when you get time. Good One.

Related Post

Leave a Reply

Your email address will not be published. Required fields are marked *