Prometheus Alert for missing metrics and labels
Prometheus Metrics missing Alert
Prometheus gives good insight when metrics are scrapped / measured. In case we have no metrics monitored or partial metrics captured it gets tricky. There are multiple ways one can identify Prometheus metrics missing alerts. We will cover 3 ways to identify same.
1. Scrap Endpoint Down
Prometheus exposes up metrics for most of the exporter. If you are preparing custom export, make sure you add up metrics.
up == 0
2. Metrics Absent
There are couple of cases where you get scrape endpoint is accessible & returns metrics, but it does not return the specific metrics because of version change or data not available for the particular metrics.
absent(job_renew_counter) == 1
3. Partial Metrics Absent
To avoid automatically removed rebbitmq queue without notice, we can specifically mentioned that queue & do monitoring. Another dynamic alternative mentiond at point 5
absent(rabbitmq_queue_messages{env="prod",queue="ott.contentId.post.queue"}) == 1
4. Partial Metrics Absent — Generic lag
Prometheus does not provide any lag related alerts if metrics data not captured, however we can record a rule to generate alerts.
Below example, we have observed that google stackdriver exporter selectively skips “topic message send request count” for some topics. This create problem as we expect “topic send request count” less then 5, we would need alert. absent does not work here, as absent only triggers if all the topic count not present.
We recorded time of scraping for each metrics
rules:
- record: stackdriver_pubsub:scraptime
expr: timestamp(stackdriver_pubsub_topic_pubsub_googleapis_com_topic_send_request_count)
Alert Rule:
Find max value over last 5 or 6 hours(duration of data captured missed). And also check older value less then 5. Below will ensure if data is not received from more then 1 hour or data pushed is less then 5 raise an alert.
time() - max_over_time(stackdriver_pubsub:scraptime[5h]) > 3600 or sum_over_time(stackdriver_pubsub_topic_pubsub_googleapis_com_topic_send_request_count[30m]) < 5
5. Prometheus monitoring with labels part missing
As seen in “Metrics Absent” if there is now metrics recieved one can alert. Now sometimes it happens that not complete metrics is down, it exists for few labels but few are missed. Now we need to trace which one are missed.
For example you have thanos setup where you have collected all the prometheus up metrics. Now you want to create a single alert, i.e. irrespective we add multiple prometheus in monitoring alert should not be tuned/changed. I used below function for the same.
count(up{job="prometheus"} offset 1h) by (project) unless count(up{job="prometheus"} ) by (project)
Here, for any project label missed in last 1 hour will trigger alert.