Maybe my requirement is little different from others that’s why didn’t get much help from the tools/procedures available. I added 2 more alerts using the similar kind of hack which I explained in my previous article.
The Application which I am working uses 3 node Cassandra cluster as DataBase. Sometimes one of the nodes goes down/out of the cluster due to some unknown
reason but we don’t get to know about it because other 2 Cassandra nodes take care of our application. We need to work on why one of 3 nodes go out of the cluster but until we find the root cause we need to make sure 3 nodes are in the cluster. That’s why we need alert around this.
You might be wondered how can we don’t know, the application might become slow if one of our nodes went down or went out of cluster but thank god we don’t come into that scenario because of Back-Pressure implementation. Our application basically takes request and do the operations and push it to Kafka instead of inserting/updating directly in Cassandra. Our Kafka Consumers pick these and insert/update them in Cassandra. Because of this, we are safe but we might come across another problem. There might be LAG in processing these requests. So we need to add Alerts to this too.
Basically, I need to add alerts around Cassandra to know if any of our nodes went down or went out of the cluster and around Kafka to know is there any LAG around processing the queue. Other metrics around Cassandra, Kafka already added to our monitoring tool.
Like my previous article added 2 more scripts to do my task. For Cassandra data collection used below script
echo [ nodetool status | grep 'GB' | awk 'NR > 1 { printf(",\n") } {printf "{\"node_status\":\"%s\",\"ip\":\"%s\",\"load\":%s,\"host_id\":\"%s\",\"rack\":\"%s\"}", $1,$2,$3,$7,$8}' printf "\n" echo ]
I did following entry in telegraf.conf to capture data
[[inputs.exec]] commands = [ "sh /home/telegraf/cassandra_node_stats.sh" ] timeout = "30s" name_override = "cassandra_node_stats" data_format = "json" tag_keys = [ "ip", "node_status", "host_id", "rack" ]
The important thing which needs to observe in above configuration is timeout because most of the times nodetool status command takes more than 10 seconds to respond. To be on safer side I added 20 seconds more. Maybe I am too desperate to capture these stats. Now its time to add the alert
I used 11 because sometimes I am getting metrics little slow due to that unnecessary alerts are coming. If I get data little late also the count come around 12, If one of node goes down I will get 10 in 5 minutes, right? To accommodate these used 11 as the threshold.
For Kafka data collection used below script
echo [ group_name=$1 /data/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group $group_name --describe | grep kafka | awk -v groupName="$group_name" 'NR > 1 { printf(",\n") } {printf "{\"topic\":\"%s\",\"group_name\":\"%s\",\"partition\":%s,\"current_offset\":%s,\"log_end_offset\":%s,\"lag\":%s,\"consumer_id\":\"%s\"}", $1,groupName,$2,$3,$4,$5,$6}' printf "\n" echo ]
We will have multiple groups so group name took as an argument. used it as variable further to send the group name in the data or else we will never know the group name at the time of putting alerts. I did following entry in telegraf.conf to capture data
[[inputs.exec]] commands = [ "sh /home/telegraf/kafka_lag_stats.sh group1", "sh /home/telegraf/kafka_lag_stats.sh group2", "sh /home/telegraf/kafka_lag_stats.sh group3" ] timeout = "30s" name_override = "kafka_lag_stats" data_format = "json" tag_keys = [ "topic", "group_name", "consumer_id", "partition" ]
For timeout same funda mentioned above. Now its time to add the alert
You might have already guessed. Yes, one of my consumers has big lag. We are working on it. We will soon fix that. I hope this article will help you.
May the force be with you.