I hope you have some idea about Telegraf, InfluxDB & Grafana. In this article, I am going to explain only how to build a graph for request counts for each response code to create an alert if you have a lot of requests are failing.
Disclaimer: I am not expert in Telegraf, InfluxDB & Grafana. You can call this implementation as small Hack to tackle the matter.
For proper monitoring, you can ignore this article add following entries in telegraf conf and give proper permissions for nginx access log. That’s it. You are good to go.
[[inputs.nginx]]
urls = ["http://localhost/nginx_status"]
[[inputs.logparser]]
files = ["/var/log/nginx/access.log"]
from_beginning = true
name_override = "nginx_access_log"
[inputs.logparser.grok]
patterns = ["%{COMBINED_LOG_FORMAT}"]
The problem I faced with above approach is that telegraf almost consuming one core CPU. The application which I am working on right now process 10,000 requests in a minute and push them to Kafka consumes half of one CPU core. So, it doesn’t make any sense for me to allow that much CPU for telegraf which is for monitoring purpose. But at the same time, I want to do monitoring and want to put alerts if requests are failing.
Then I thought why shouldn’t I insert/push counts to influxDB. Based on that I can put some alerts. Then I used below command to get request counts based on response codes which were processed in last minute
tail -30000 /var/log/nginx/access.log | awk -v date=$(date -d '1 minutes ago' +"%d/%b/%Y:%H:%M") '$4 ~ date' | cut -d '"' -f3 | cut -d ' ' -f2 | sort | uniq -c
This command will give ouptut like this
6540 200 4 301 28 304 11 400 6 404 2 408 62 499 51 504
Now the problem is how I insert/push this to influxDB. Then I found a way Exec input plugin. But now the problem is how I need to convert this to the format accepted by telegraf.
Then I extended above command little bit. Now I wrote a small shell script
echo [
tail -30000 /var/log/nginx/access.log | awk -v date=$(date -d '1 minutes ago' +"%d/%b/%Y:%H:%M") '$4 ~ date' | cut -d '"' -f3 | cut -d ' ' -f2 | sort | uniq -c | awk 'NR > 1 { printf(",\n") } {printf "{\"resp_code\":%s,\"count\":%s}", $2,$1}'
printf "\n"
echo ]
This gives output like this
[ {"resp_code":200,"count":6540}, {"resp_code":301,"count":4}, {"resp_code":304,"count":28}, {"resp_code":400,"count":11}, {"resp_code":404,"count":6}, {"resp_code":408,"count":2}, {"resp_code":499,"count":62}, {"resp_code":504,"count":51} ]
I did following entry in telegraf.conf to capture this
[[inputs.exec]] commands = [ "sh /home/telegraf/nginx_stats.sh" ] timeout = "5s" name_override = "nginx_stats" data_format = "json" tag_keys = [ "resp_code" ]
Done. Now in influxDB at nginx_stats measurement, you will get your every minute counts. In this process, I got nginx access log permission issue but that was easily solved by the answer provided in this link. In this process, I found one more interesting software i.e., ngxtop. Check ngxtop when you get time. Good One.
Hi, thanks for sharing valuable information!
Is it possible to remove bot sessions from above config?
Never mind. 🙂