Prometheus alert rules are the backbone of any monitoring setup. I've seen teams copy rules from example repositories without understanding a single line of PromQL, then wonder why they get paged for garbage. Reading alert rules is a skill—you need to parse the expression, interpret the intent behind the thresholds, and spot the subtle bugs that can cause false positives or, worse, missed alerts.
This isn't a tutorial on writing alert rules. This is about reading them. You're on-call, an alert fires, and you need to understand what the rule author was thinking. Or you're reviewing a PR that adds a rule and you want to catch mistakes before they hit production. Let's get into the actual mechanics.
The Anatomy of an Alert Rule
groups:
- name: example
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value | humanizePercentage }} for instance {{ $labels.instance }}"Related debugging guides on Buglyst
Every rule has four components. The expr is the PromQL expression that evaluates to a vector of instant vectors; each element becomes a potential alert. The for clause adds a duration window. Labels are attached to the alert and used for routing and deduplication in Alertmanager. Annotations provide human-readable context—these are what you see in the notification.
The most common mistake I see is overloading labels with information that should be in annotations. Labels are for identity—they determine how alerts group and route. Annotations are for the human. If you put the current value in a label, every value change creates a new alert fingerprint, and you'll get duplicates. Keep labels stable.
The expr: Where Most Bugs Live
The expression is where the logic lives, and it's where reading gets hard. Let's pick apart the example above. The expression computes the ratio of 5xx requests to total requests over the last 5 minutes. The denominator uses a rate on all status codes; the numerator only on 5xx. If the total rate is zero (no traffic), the division yields NaN, and the condition '> 0.05' is false—so no alert. That's correct behavior, but I've seen rules that forget to handle division by zero and cause spurious alerts.
Another trap: using 'increase()' instead of 'rate()' for thresholding. 'increase()' gives a raw count over the time window, which is sensitive to the window length. A 5-minute increase of 50 errors might be fine at low traffic but critical at high traffic. 'rate()' normalizes per second, making thresholds portable across different load levels.
Also watch out for range vectors. The [5m] in the example is the lookback window for the rate. If your scrape interval is 15s, 5m gives you 20 data points—enough for a stable rate. But if you use [1m] with a 15s scrape, you only get 4 points, and the rate can jump around. Always check that the range vector is at least 4x the scrape interval.
Be careful with operators like 'or' and 'unless' in alert expressions. They can cause alerts to appear/disappear based on metric presence, which might not be what you want. For example, 'metric_a > 0 unless metric_b > 0' will fire only when metric_a is present and metric_b is not—if metric_b disappears entirely, the alert fires, which could be a silent failure.
The 'for' Clause: Patience or Pitfall?
The 'for' clause is the minimum duration the condition must be continuously true before the alert fires. This prevents flapping. However, the evaluation happens at the rule evaluation interval (default 1m). If 'for: 5m', the condition must be true at each evaluation step for 5 consecutive minutes. If the metric is scraped every 30s, but evaluated every 1m, a 30-second spike might be missed entirely.
I once inherited a rule with 'for: 10m' on a latency alert. The SLO was 99.9%, but the team kept missing brief latency spikes that lasted 2–3 minutes. They'd resolve before the alert ever fired. The fix was to lower 'for' to 1m or use a recording rule that aggregates more granularly. The 'for' clause is not a band-aid for noisy metrics—fix the metric stability instead.
Annotations: Your Lifeline at 3 AM
Annotations are what show up in PagerDuty, Slack, or email. A good annotation tells you what's broken, how bad it is, and where to look. The example above uses Go template syntax: '{{ $labels.instance }}' and '{{ $value }}'. You can use any label from the alert, plus built-in variables like $value, $labels, and $externalURL.
Common annotation fields: summary (one-liner), description (detailed), runbook_url (link to playbook), and dashboard (link to a Grafana panel). I recommend always including the current value and a link to a relevant dashboard. The worst annotation I've seen: 'description: Something is wrong'. That's not helpful. Be specific.
The Case of the Silent Pager
- 02:15PagerDuty alert fires: 'HighErrorRate' with description 'Error rate is 5.2% for instance web-03'
- 02:16Engineer checks the dashboard; error rate is 5.2%, but it's been climbing for an hour.
- 02:20Engineer notices the alert rule has 'for: 10m', but the error rate was 4.9% 11 minutes ago, so the alert just fired.
- 02:25Investigation reveals a gradual memory leak causing intermittent 500s; the alert threshold was 5%, but the rule only fired after 10 consecutive minutes above threshold.
- 02:40Rollback deployed. Incident resolved.
- 02:45Postmortem: The 'for' duration masked the slow ramp-up. The team decides to add a separate alert with a lower threshold (3%) and no 'for' for early warning.
Lesson
The 'for' clause can delay detection of gradual degradations. Use a combination of rules: one with a low threshold and no 'for' for early warning, and a higher threshold with 'for' for confirmed issues.
Reading Labels for Routing
Labels in the alert rule are merged with the metric labels. They're used by Alertmanager to route alerts to the right receiver (e.g., team, severity). The most important label is 'severity'—commonly 'critical', 'warning', 'info'. You might also have 'team' or 'service'.
A common antipattern is using dynamic labels like 'instance' in the rule's label section. That's redundant because the metric already has that label. Only add labels that are not already present, or override metric labels for routing purposes. Overriding a metric label is rare and should be done with caution—it changes the alert identity.
Templates: The Power and the Pitfall
Go templates in annotations are powerful, but they can also break silently. If a template references a label that doesn't exist on the alert, the template rendering fails and the notification might show an error string. Always test your templates with actual data.
You can test templates using the 'amtool' command-line tool or by sending a test alert to Alertmanager's API. I've seen templates that use '{{ $labels.team }}' but the rule doesn't have a 'team' label—the alert fires but the notification is blank. That's frustrating.
of alert rules in a 2023 survey had at least one annotation template error
(That stat is made up, but it feels true based on my experience.) Point is: test your templates.
Putting It All Together: How to Read a Rule
- 1Start with the alert name and labels: what is this alert about? Who will get it?
- 2Read the expr line by line. Identify the metric, the aggregations, and the threshold. Convert the expression to plain English.
- 3Check the for clause. Is it appropriate for the metric's volatility?
- 4Look at the annotations. Do they provide enough context for a responder to triage? Is there a runbook URL?
- 5Consider the evaluation interval. Rules are evaluated every 1m by default—does the for duration make sense given the scrape interval?
- 6Look for known pitfalls: division by zero, counter resets, missing labels in templates, and overly long range vectors.
When reviewing a rule in a pull request, simulate it with actual data. Use 'promtool check rules' and 'promtool test rules' to validate syntax and test with sample metrics. Then run the expression in Prometheus's expression browser with a time range that covers past incidents.
Beyond Basic Reading: Advanced Patterns
Once you're comfortable reading standard rules, you'll encounter patterns like multi-dimensional alerts (using 'by' or 'without' in the expression), alerts based on recording rules, and alerts that use 'absent()' for dead man's switches. The same principles apply: understand the metric, the aggregation, and the intent.
One pattern I like is using a recording rule to pre-process a complex expression, then alert on the recording rule. This separates the computation from the alerting logic and makes both easier to read. For example:
recording: 'job:error_rate:5m' with expression 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])'
Then the alert rule simply does 'job:error_rate:5m > 0.05'. Clean and testable.
Reading Prometheus alert rules is a skill that improves with practice. Next time you're on-call and an alert fires, don't just acknowledge it—read the rule. Understand the expression, the for clause, the annotations. You might find a bug that's been hiding for months. And if you do, file a PR to fix it. Your future self (and your teammates) will thank you.
Frequently asked questions
What does the 'for' clause in a Prometheus alert rule actually do?
The 'for' clause specifies how long a condition must be true before the alert fires. It prevents flapping by waiting for sustained breaches. For example, 'for: 5m' means the expression must be true for 5 consecutive minutes of evaluation (each evaluation interval, usually 1m, checks the condition).
How do I silence an alert based on a label without losing the rule?
You can create a silence in Alertmanager that matches one or more labels. For example, if your rule has a label 'severity: critical', you can silence all alerts with that label for a duration. Silences are stored in Alertmanager, not in the rule file.
Why does my alert fire but then disappear immediately?
This often happens when the PromQL expression is based on a counter rate or a metric that resets. Check if your expression uses 'rate()' or 'increase()' properly—counter resets can cause the value to drop below the threshold. Also verify that the 'for' duration is appropriate; if it's 0, the alert fires as soon as the condition is true, but may quickly resolve.
Can I alert on the absence of a metric?
Yes, use the 'absent()' function. For example, 'absent(up{job="api"})' will fire if the metric 'up' with that label no longer exists. However, be careful with 'for'; if the metric disappears and reappears, the timer resets. Also consider using 'absent_over_time()' if you need to wait for sustained absence.