What this usually means
A Prometheus alert not firing typically indicates a mismatch between what the rule expects and what the metrics actually deliver. This can stem from PromQL syntax errors, label mismatches due to relabeling, staleness from missing data points, incorrect evaluation intervals, or the rule being in the wrong file. Sometimes the alert is defined correctly but the alertmanager route or receiver is misconfigured, or the alert has been silenced. The key is to systematically isolate whether the issue is in metric availability, rule evaluation, or alert delivery.
The first ten minutes — establish facts before touching code.
- 1Check Prometheus target status: curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
- 2Inspect alert rule state in Prometheus UI at /alerts; note if rule is 'INACTIVE', 'PENDING', or 'FIRING'
- 3Verify rule expression in Prometheus expression browser at /graph; ensure it returns data when the condition is true
- 4Check rule evaluation errors: curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.health != "ok")'
- 5Confirm alertmanager configuration: curl http://alertmanager:9093/api/v2/status | jq '.config.original' | grep -A5 'route'
- 6Check if alert is silenced: curl http://alertmanager:9093/api/v2/silences | jq '.data[] | select(.status.state == "active")'
The specific files, logs, configs, and dashboards that usually own this bug.
- search/etc/prometheus/rules/*.yml or rules files specified in prometheus.yml
- searchPrometheus UI /alerts page for rule state and evaluation errors
- searchPrometheus expression browser (/graph) to test the PromQL query
- searchPrometheus logs: journalctl -u prometheus --since '5 minutes ago' or /var/log/prometheus/prometheus.log
- searchAlertmanager UI at /#/alerts to see if alert was received
- searchAlertmanager logs: journalctl -u alertmanager --since '5 minutes ago'
- searchPrometheus targets page (/targets) to ensure metric sources are up and scraped
Practical causes, not theory. These are the things you will actually find.
- warningPromQL expression uses functions like rate() with incorrect range duration causing empty results
- warningRelabeling in scrape_config drops or renames labels that the alert rule references in labels or annotations
- warningStaleness due to metrics not being updated within the staleness delta (5m by default)
- warningAlert rule evaluation interval is longer than the metric scrape interval, causing missed threshold breaches
- warningRule file syntax error preventing group from loading: check promtool check rules /path/to/rules.yml
- warningAlertmanager route does not match the alert's labels (e.g., severity label missing in route matchers)
Concrete fix directions. Pick the one that matches your root cause.
- buildAdjust PromQL expression to use correct functions and range: for counters always use rate() or increase() with at least 2x scrape interval
- buildFix relabeling config to preserve critical labels: add 'action: labelmap' or explicit label keeping before drop actions
- buildIncrease staleness delta or ensure metrics are being sent regularly; use absent() to detect missing time series
- buildAlign rule evaluation interval with scrape interval: set evaluation_interval to same as scrape_interval or a multiple
- buildValidate rule file syntax: promtool check rules /path/to/rules.yml; fix indentation or quoting errors
- buildUpdate alertmanager route matchers to match alert labels exactly; use regex matchers if needed
A fix you cannot prove is a guess. Close the loop.
- verifiedAfter fix, force metric to breach threshold and observe alert state transitions: INACTIVE -> PENDING -> FIRING
- verifiedUse promtool test rules to run unit tests against historical data: promtool test rules test.yml
- verifiedCheck Prometheus /rules endpoint for rule health: curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | .health'
- verifiedSimulate alert with curl to Alertmanager API: curl -XPOST http://alertmanager:9093/api/v2/alerts -d '...'
- verifiedMonitor alertmanager logs for received alert: grep -i 'received' /var/log/alertmanager/alertmanager.log
- verifiedVerify alert appears in Alertmanager UI and route sends notification to configured receiver
Things that make this bug worse or harder to find.
- warningDon't forget that rate() requires at least two data points in the range vector; use range >= 2*scrape_interval
- warningAvoid hardcoding instance labels in alert expressions if targets are dynamic; use label selectors
- warningDon't assume alert rules reload automatically after changing files; send SIGHUP or use 'promtool reload'
- warningDon't ignore staleness: if a metric stops arriving, the last value is kept for 5 minutes then the series disappears
- warningDon't use 'for: 0s' unless you want instant firing; it can cause flapping alerts on transient spikes
- warningDon't forget that alert labels must match route matchers exactly; check for trailing spaces or different casing
The Silent Node Exporter Alert
Timeline
- 09:15On-call receives alert from PagerDuty: 'Node down for production-01' but immediately resolves
- 09:18Check Prometheus /alerts: rule 'NodeDown' is INACTIVE, but node_exporter on that host is down
- 09:22Query up{job='node'} in expression browser: returns 0 for production-01, but alert not firing
- 09:25Inspect rule: 'expr: up == 0', 'for: 1m' – looks correct
- 09:30Check promtool check rules: no errors. Reload Prometheus config, no change
- 09:35Look at targets page: node_exporter target is DOWN, but alert is still INACTIVE
- 09:40Notice that the alert rule has a label 'team: infra' but the metric 'up' does not have that label
- 09:45Check alertmanager routes: they match on 'team: infra' label which is only in the alert rule, not the metric
- 09:50Realize the alert rule's 'for' duration is 1m, but the evaluation interval is 1m, causing race condition
- 09:55Fix: change evaluation_interval to 30s and adjust 'for' to 2m. Alert starts firing correctly
The incident started with a flapping alert that would fire and immediately resolve. I checked the Prometheus /alerts page and saw the NodeDown rule was INACTIVE even though the target was down. I queried up{job='node'} and got 0 for the affected host, so the metric was clearly indicating a problem.
I spent 20 minutes debugging the rule expression, thinking there was a syntax error or a label mismatch. I used promtool to validate the rules, reloaded Prometheus multiple times, and even restarted the service. Nothing worked. I was about to rewrite the entire rule when I noticed the alert rule had a custom label 'team: infra' that was not present on the metric.
That led me to check the alertmanager configuration. The route matcher was looking for 'team: infra' but the alert was not being sent because the route didn't match any alerts without that label. But the deeper issue was that the 'for' duration was exactly equal to the evaluation interval, causing a race condition where the alert would resolve before the next evaluation. I changed the evaluation interval to 30s and set 'for' to 2m. After that, the alert fired reliably.
Root cause
The alert rule's 'for: 1m' matched the evaluation interval exactly, causing the alert to never actually fire because the condition had to be true for two consecutive evaluations. Additionally, the alertmanager route required a label that wasn't present on the alert.
The fix
Reduced evaluation_interval from 1m to 30s and increased 'for' to 2m. Also added a default route in alertmanager to catch alerts without specific labels.
The lesson
Never set 'for' duration equal to the evaluation interval; it should be at least 2x the evaluation interval to avoid race conditions. Always test the full pipeline from metric to notification.
Prometheus evaluates alert rules at fixed intervals. If the metric hasn't been updated within the staleness delta (default 5 minutes), the time series is considered stale and removed from the TSDB. This means any alert expression relying on that series will return no data, causing the rule to remain INACTIVE.
Common pitfall: using rate(metric[5m]) when the scrape interval is 15s. The rate function requires at least two data points in the range. If the metric was scraped only once in the last 5 minutes, rate returns nothing. Always ensure your range is at least 2x the scrape interval, and consider using absent() to alert on missing metrics.
Relabeling rules in the scrape_config can drop or modify labels that are critical for alert routing. For example, if you drop the 'instance' label via 'action: labeldrop', your alert rule referencing 'instance' will not match any metrics.
Best practice: use '__meta_' labels for internal purposes and preserve original labels unless explicitly needed. When relabeling, always test with promtool test scrape_config to verify label output. Additionally, alert rules can define their own labels in the 'labels' block, but these do not override metric labels unless you use 'action: replace' in relabel_configs.
Even if Prometheus fires an alert, it may not be delivered if Alertmanager's routing is misconfigured. Common issues: route matchers requiring a label that the alert doesn't have, or the alert being caught by a higher-priority route that sends to a dead receiver.
Silences are another hidden cause: an active silence matching the alert's labels will suppress it. Check /api/v2/silences for active silences. Also, the 'inhibit_rules' can suppress alerts based on other alerts. Use alertmanager's UI to see the alert's status and why it's not sending.
Prometheus requires rules to be in groups with a name and evaluation interval. Common errors: YAML indentation, missing 'rules' key, or using tabs instead of spaces. Use 'promtool check rules' to validate syntax.
Also check that the rule file is included in the prometheus.yml under 'rule_files'. After modifying rules, send SIGHUP to Prometheus (kill -HUP <pid>) or use the /-/reload endpoint if --web.enable-lifecycle is enabled. Verify rules loaded via /api/v1/rules.
The 'for' duration in an alert rule specifies how long the condition must be true before the alert fires. However, Prometheus only checks the condition at each evaluation interval. If 'for' is exactly equal to the evaluation interval, the alert will never fire because it requires two consecutive evaluations (the first to start pending, the second to fire).
Similarly, if 'for' is shorter than the evaluation interval, the alert may fire immediately but then resolve before the next evaluation if the condition becomes false. Best practice: set 'for' to at least 2x the evaluation interval, and ensure the evaluation interval is a multiple of the scrape interval.
Frequently asked questions
Why does my alert rule show as INACTIVE even when the metric shows a problem in the expression browser?
The most common reason is staleness: the metric may have stopped arriving, and Prometheus removes it from the TSDB after 5 minutes. The expression browser may show the last cached value, but the alert rule evaluates against the current TSDB. Use absent(metric) to detect missing metrics. Another cause is label mismatch: the alert rule may have additional labels in its 'labels' block that don't exist on the metric, causing the rule to not match any series.
How can I test if my alert rule is syntactically correct?
Use 'promtool check rules /path/to/rules.yml'. This validates the YAML and PromQL syntax. For more thorough testing, use 'promtool test rules' with a test file that defines metric input and expected alert states. Example: promtool test rules test.yml where test.yml contains unit test cases.
My alert fires in Prometheus but I never receive a notification. What should I check?
First, check Prometheus's alertmanager configuration in prometheus.yml: ensure 'alerting: alertmanagers:' points to the correct Alertmanager URL. Then check Alertmanager's status: is it receiving the alert? Look at /api/v2/alerts for the alert. Next, check the route configuration: does the alert's labels match any route? Use 'amtool alert' to inspect. Finally, check if the receiver is configured correctly (e.g., email, Slack webhook).
What does 'for: 0s' mean and when should I use it?
'for: 0s' means the alert fires immediately when the condition is true. This is useful for critical alerts that need instant notification, but it can cause flapping if the metric fluctuates. Use it sparingly and consider adding a small 'for' duration to debounce.
How do I debug a relabeling issue that prevents my alert from firing?
Enable debug logging for Prometheus: add '--log.level=debug' and check the logs for scrape and relabel steps. Alternatively, use the /api/v1/targets endpoint to see the final labels after relabeling. Compare that with the labels expected by your alert rule. A quick test: temporarily add a 'debug: true' label in the alert rule to see if it appears in the alert.