Feature flags are supposed to make rollouts safe. Flip a switch, test with 1% of users, ramp up, done. But when something goes wrong during a rollout, the debugging becomes a nightmare. Unlike a code change that you can revert, a feature flag rollout introduces state that lives outside your deployment. The flag evaluation can be cached, overridden, or silently fail in ways that are hard to trace.
This is the story of the Live-at-5 incident — a feature flag rollout that caused a cascading failure across three services. We'll walk through the debugging process step by step, from the first alert to the root cause, and then discuss the tools and practices we built to prevent it from happening again.
The Incident: Live at 5 PM on a Thursday
We were rolling out a new payment flow — a redesigned checkout page that used a new fraud detection service. The feature flag was named checkout_v2. The rollout plan was simple: 5% on day one, 25% on day two, 50% on day three, then 100%. Day one went fine. Day two, the rollout to 25% was scheduled for 5 PM.
At 5:03 PM, PagerDuty lit up. Error rates on the payments service spiked from 0.1% to 12%. The checkout service was timing out. The fraud service was returning 503s. I was the on-call engineer.
Initial Triage: Is It the Flag or Something Else?
My first instinct was to check recent deployments. None. Then I checked the flag. The rollout percentage was set to 25% as planned. But I noticed something odd: the flag was evaluated in three different services — checkout, payments, and fraud. Each service had its own copy of the flag configuration. That's when I suspected flag drift.
Flag drift: when the same flag key has different configurations across services or environments. In our case, the fraud service had the flag set to 100% rollout, while checkout and payments had 25%. The fraud service was seeing all traffic, not just the canary.
Tracing the Flag Evaluation
I needed to see what each service was evaluating. Our flag system (LaunchDarkly) had a debug logging mode, but it wasn't enabled. So I had to grep through logs manually. I found the issue: the fraud service's flag evaluation was returning 'true' for every request, regardless of the rollout percentage. Why?
// Pseudocode of what the fraud service was doing
const flagClient = new LaunchDarkly.Client(env.SDK_KEY);
// Bug: The rollout percentage was never read from the server
// Instead, it used a stale local default of 100%
const user = { key: userId };
const showNewFlow = await flagClient.variation(
'checkout_v2',
user,
true // default value = true
);
// Because the default was true, and the flag evaluation failed silently,
// every user got the new flow.
if (showNewFlow) {
// call new fraud detection
}The fraud service was running an older version of the LaunchDarkly SDK that had a bug: if the initial connection to the flag server timed out (which happened due to a network partition), the SDK would fall back to the default value without logging an error. The default was true, so the flag was effectively 100% for the fraud service.
The Fix: No Silent Defaults
The immediate fix was to change the default to false and restart the fraud service. That stopped the cascade. But the real fix was more systemic: never use a default that enables a new feature. The default should always be the old behavior. And we needed to log every flag evaluation failure.
Live-at-5 Incident Timeline
- 17:00Rollout to 25% triggered via LaunchDarkly UI
- 17:03PagerDuty alert: payment service error rate > 5%
- 17:05On-call engineer acknowledges; checks deployments — none
- 17:08Checks flag configuration; notices fraud service has 100% rollout
- 17:12Disables flag for fraud service; error rate drops
- 17:20Root cause: SDK fallback to default 'true' due to network timeout
- 17:45Fix deployed: change default to false, add logging
Lesson
Never let a feature flag default to the new behavior. Always default to the old (safe) path. Log every evaluation failure with the flag key and fallback reason.
Building a Better Debugging Toolkit
After the incident, we made three changes to how we debug feature flag rollouts. These are not revolutionary — they're basic hygiene — but they would have caught this bug in staging.
- 1Log every flag evaluation decision: flag key, user ID (or anonymous identifier), rollout percentage, evaluated variant, and any fallback reason. Use structured logging (JSON) so you can query by flag key.
- 2Add a flag health check endpoint: each service exposes a /health/flag endpoint that returns the current evaluated state of active flags for a test user. Automated tests hit this endpoint after a rollout change to verify consistency across services.
- 3Use a canary with gradual ramp-up: automate the rollout in 1% increments with a 10-minute cooldown between steps. If error rates exceed a threshold, auto-rollback.
// Example of logging a flag evaluation
function getFeatureFlag(flagKey, user, defaultValue = false) {
const start = Date.now();
let value, error;
try {
value = flagClient.variation(flagKey, user, defaultValue);
} catch (e) {
value = defaultValue;
error = e.message;
}
const duration = Date.now() - start;
logger.info({
event: 'flag_evaluation',
flagKey,
userId: user.key,
rolloutPercentage: flagClient.variationDetail(flagKey, user).reason?.percentage,
value,
defaultValue,
duration,
error
});
return value;
}Flag Drift Detection
Flag drift is when the same flag has different configurations across environments or services. We now run a daily cron job that queries the flag management API and compares the configurations across all our services and environments. If a flag is set to 100% in staging but 25% in production, we get an alert.
- arrow_rightCheck that the rollout percentage is consistent across services that evaluate the same flag.
- arrow_rightCheck that the flag's targeting rules are identical in staging and production.
- arrow_rightCheck that no service has a stale SDK version that might fall back to a different default.
What I Wish I Knew Before the Incident
A feature flag rollout is not a deployment. You can't revert it by rolling back a commit. You have to trace the flag's state across every service that evaluates it.
The Live-at-5 incident was a painful lesson. But it forced us to treat feature flags as first-class state in our system. Now, every flag rollout includes a pre-flight check that verifies flag consistency, a gradual canary ramp, and comprehensive logging. The next time a flag goes wrong, we'll know exactly where to look.
Frequently asked questions
How do I debug a feature flag that works in staging but fails in production?
Check for environment-specific flag configurations, stale caches, and differences in flag evaluation logic. Use structured logging to capture the flag key, target context, and evaluated result at each service boundary.
What is flag drift and how do I prevent it?
Flag drift occurs when the same flag has different values across environments (e.g., staging vs. production). Prevent it by using a single source of truth for flag definitions, automating environment synchronization, and adding drift detection alerts.
Should I log every feature flag evaluation in production?
Yes, but use sampled logging for high-traffic flags to avoid overwhelming your log system. At minimum, log all evaluations for flags involved in active rollouts. Include the flag key, user ID, rollout percentage, and evaluated variant.
What's the best way to test feature flag rollouts before going to production?
Run canary rollouts in production with a small percentage of traffic (e.g., 1%). Add automated flag health checks that verify the flag behaves as expected in the canary group. Use feature flag testing in staging with realistic traffic patterns.