Prometheus Alert Not Firing: Debugging Guide

What this usually means

A Prometheus alert not firing typically indicates a mismatch between what the rule expects and what the metrics actually deliver. This can stem from PromQL syntax errors, label mismatches due to relabeling, staleness from missing data points, incorrect evaluation intervals, or the rule being in the wrong file. Sometimes the alert is defined correctly but the alertmanager route or receiver is misconfigured, or the alert has been silenced. The key is to systematically isolate whether the issue is in metric availability, rule evaluation, or alert delivery.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Check Prometheus target status: curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
2Inspect alert rule state in Prometheus UI at /alerts; note if rule is 'INACTIVE', 'PENDING', or 'FIRING'
3Verify rule expression in Prometheus expression browser at /graph; ensure it returns data when the condition is true
4Check rule evaluation errors: curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.health != "ok")'
5Confirm alertmanager configuration: curl http://alertmanager:9093/api/v2/status | jq '.config.original' | grep -A5 'route'
6Check if alert is silenced: curl http://alertmanager:9093/api/v2/silences | jq '.data[] | select(.status.state == "active")'

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

search/etc/prometheus/rules/*.yml or rules files specified in prometheus.yml
searchPrometheus UI /alerts page for rule state and evaluation errors
searchPrometheus expression browser (/graph) to test the PromQL query
searchPrometheus logs: journalctl -u prometheus --since '5 minutes ago' or /var/log/prometheus/prometheus.log
searchAlertmanager UI at /#/alerts to see if alert was received
searchAlertmanager logs: journalctl -u alertmanager --since '5 minutes ago'
searchPrometheus targets page (/targets) to ensure metric sources are up and scraped

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningPromQL expression uses functions like rate() with incorrect range duration causing empty results
warningRelabeling in scrape_config drops or renames labels that the alert rule references in labels or annotations
warningStaleness due to metrics not being updated within the staleness delta (5m by default)
warningAlert rule evaluation interval is longer than the metric scrape interval, causing missed threshold breaches
warningRule file syntax error preventing group from loading: check promtool check rules /path/to/rules.yml
warningAlertmanager route does not match the alert's labels (e.g., severity label missing in route matchers)

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildAdjust PromQL expression to use correct functions and range: for counters always use rate() or increase() with at least 2x scrape interval
buildFix relabeling config to preserve critical labels: add 'action: labelmap' or explicit label keeping before drop actions
buildIncrease staleness delta or ensure metrics are being sent regularly; use absent() to detect missing time series
buildAlign rule evaluation interval with scrape interval: set evaluation_interval to same as scrape_interval or a multiple
buildValidate rule file syntax: promtool check rules /path/to/rules.yml; fix indentation or quoting errors
buildUpdate alertmanager route matchers to match alert labels exactly; use regex matchers if needed

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedAfter fix, force metric to breach threshold and observe alert state transitions: INACTIVE -> PENDING -> FIRING
verifiedUse promtool test rules to run unit tests against historical data: promtool test rules test.yml
verifiedCheck Prometheus /rules endpoint for rule health: curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | .health'
verifiedSimulate alert with curl to Alertmanager API: curl -XPOST http://alertmanager:9093/api/v2/alerts -d '...'
verifiedMonitor alertmanager logs for received alert: grep -i 'received' /var/log/alertmanager/alertmanager.log
verifiedVerify alert appears in Alertmanager UI and route sends notification to configured receiver

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningDon't forget that rate() requires at least two data points in the range vector; use range >= 2*scrape_interval
warningAvoid hardcoding instance labels in alert expressions if targets are dynamic; use label selectors
warningDon't assume alert rules reload automatically after changing files; send SIGHUP or use 'promtool reload'
warningDon't ignore staleness: if a metric stops arriving, the last value is kept for 5 minutes then the series disappears
warningDon't use 'for: 0s' unless you want instant firing; it can cause flapping alerts on transient spikes
warningDon't forget that alert labels must match route matchers exactly; check for trailing spaces or different casing

( 07 )War story

The Silent Node Exporter Alert

Platform EngineerPrometheus 2.45, Alertmanager 0.26, Node Exporter 1.6, Kubernetes 1.27

Timeline

09:15On-call receives alert from PagerDuty: 'Node down for production-01' but immediately resolves
09:18Check Prometheus /alerts: rule 'NodeDown' is INACTIVE, but node_exporter on that host is down
09:22Query up{job='node'} in expression browser: returns 0 for production-01, but alert not firing
09:25Inspect rule: 'expr: up == 0', 'for: 1m' – looks correct
09:30Check promtool check rules: no errors. Reload Prometheus config, no change
09:35Look at targets page: node_exporter target is DOWN, but alert is still INACTIVE
09:40Notice that the alert rule has a label 'team: infra' but the metric 'up' does not have that label
09:45Check alertmanager routes: they match on 'team: infra' label which is only in the alert rule, not the metric
09:50Realize the alert rule's 'for' duration is 1m, but the evaluation interval is 1m, causing race condition
09:55Fix: change evaluation_interval to 30s and adjust 'for' to 2m. Alert starts firing correctly

The incident started with a flapping alert that would fire and immediately resolve. I checked the Prometheus /alerts page and saw the NodeDown rule was INACTIVE even though the target was down. I queried up{job='node'} and got 0 for the affected host, so the metric was clearly indicating a problem.

I spent 20 minutes debugging the rule expression, thinking there was a syntax error or a label mismatch. I used promtool to validate the rules, reloaded Prometheus multiple times, and even restarted the service. Nothing worked. I was about to rewrite the entire rule when I noticed the alert rule had a custom label 'team: infra' that was not present on the metric.

That led me to check the alertmanager configuration. The route matcher was looking for 'team: infra' but the alert was not being sent because the route didn't match any alerts without that label. But the deeper issue was that the 'for' duration was exactly equal to the evaluation interval, causing a race condition where the alert would resolve before the next evaluation. I changed the evaluation interval to 30s and set 'for' to 2m. After that, the alert fired reliably.

Root cause

The alert rule's 'for: 1m' matched the evaluation interval exactly, causing the alert to never actually fire because the condition had to be true for two consecutive evaluations. Additionally, the alertmanager route required a label that wasn't present on the alert.

The fix

Reduced evaluation_interval from 1m to 30s and increased 'for' to 2m. Also added a default route in alertmanager to catch alerts without specific labels.

The lesson

Never set 'for' duration equal to the evaluation interval; it should be at least 2x the evaluation interval to avoid race conditions. Always test the full pipeline from metric to notification.

( 08 )Understanding PromQL Evaluation and Staleness

Prometheus evaluates alert rules at fixed intervals. If the metric hasn't been updated within the staleness delta (default 5 minutes), the time series is considered stale and removed from the TSDB. This means any alert expression relying on that series will return no data, causing the rule to remain INACTIVE.

Common pitfall: using rate(metric[5m]) when the scrape interval is 15s. The rate function requires at least two data points in the range. If the metric was scraped only once in the last 5 minutes, rate returns nothing. Always ensure your range is at least 2x the scrape interval, and consider using absent() to alert on missing metrics.

( 09 )Relabeling and Label Propagation

Relabeling rules in the scrape_config can drop or modify labels that are critical for alert routing. For example, if you drop the 'instance' label via 'action: labeldrop', your alert rule referencing 'instance' will not match any metrics.

Best practice: use '__meta_' labels for internal purposes and preserve original labels unless explicitly needed. When relabeling, always test with promtool test scrape_config to verify label output. Additionally, alert rules can define their own labels in the 'labels' block, but these do not override metric labels unless you use 'action: replace' in relabel_configs.

( 10 )Alertmanager Routing and Silences

Even if Prometheus fires an alert, it may not be delivered if Alertmanager's routing is misconfigured. Common issues: route matchers requiring a label that the alert doesn't have, or the alert being caught by a higher-priority route that sends to a dead receiver.

Silences are another hidden cause: an active silence matching the alert's labels will suppress it. Check /api/v2/silences for active silences. Also, the 'inhibit_rules' can suppress alerts based on other alerts. Use alertmanager's UI to see the alert's status and why it's not sending.

( 11 )Rule File Structure and Validation

Prometheus requires rules to be in groups with a name and evaluation interval. Common errors: YAML indentation, missing 'rules' key, or using tabs instead of spaces. Use 'promtool check rules' to validate syntax.

Also check that the rule file is included in the prometheus.yml under 'rule_files'. After modifying rules, send SIGHUP to Prometheus (kill -HUP <pid>) or use the /-/reload endpoint if --web.enable-lifecycle is enabled. Verify rules loaded via /api/v1/rules.

( 12 )Race Conditions with 'for' and Evaluation Intervals

The 'for' duration in an alert rule specifies how long the condition must be true before the alert fires. However, Prometheus only checks the condition at each evaluation interval. If 'for' is exactly equal to the evaluation interval, the alert will never fire because it requires two consecutive evaluations (the first to start pending, the second to fire).

Similarly, if 'for' is shorter than the evaluation interval, the alert may fire immediately but then resolve before the next evaluation if the condition becomes false. Best practice: set 'for' to at least 2x the evaluation interval, and ensure the evaluation interval is a multiple of the scrape interval.

Frequently asked questions

Why does my alert rule show as INACTIVE even when the metric shows a problem in the expression browser?

The most common reason is staleness: the metric may have stopped arriving, and Prometheus removes it from the TSDB after 5 minutes. The expression browser may show the last cached value, but the alert rule evaluates against the current TSDB. Use absent(metric) to detect missing metrics. Another cause is label mismatch: the alert rule may have additional labels in its 'labels' block that don't exist on the metric, causing the rule to not match any series.

How can I test if my alert rule is syntactically correct?

Use 'promtool check rules /path/to/rules.yml'. This validates the YAML and PromQL syntax. For more thorough testing, use 'promtool test rules' with a test file that defines metric input and expected alert states. Example: promtool test rules test.yml where test.yml contains unit test cases.

My alert fires in Prometheus but I never receive a notification. What should I check?

First, check Prometheus's alertmanager configuration in prometheus.yml: ensure 'alerting: alertmanagers:' points to the correct Alertmanager URL. Then check Alertmanager's status: is it receiving the alert? Look at /api/v2/alerts for the alert. Next, check the route configuration: does the alert's labels match any route? Use 'amtool alert' to inspect. Finally, check if the receiver is configured correctly (e.g., email, Slack webhook).

What does 'for: 0s' mean and when should I use it?

'for: 0s' means the alert fires immediately when the condition is true. This is useful for critical alerts that need instant notification, but it can cause flapping if the metric fluctuates. Use it sparingly and consider adding a small 'for' duration to debounce.

How do I debug a relabeling issue that prevents my alert from firing?

Enable debug logging for Prometheus: add '--log.level=debug' and check the logs for scrape and relabel steps. Alternatively, use the /api/v1/targets endpoint to see the final labels after relabeling. Compare that with the labels expected by your alert rule. A quick test: temporarily add a 'debug: true' label in the alert rule to see if it appears in the alert.

Prometheus Alert Not Firing: A Systematic Debugging Guide

What this usually means

Frequently asked questions