I woke up to a cascade of PagerDuty alerts. Our Kubernetes cluster was healthy, but every service was reporting timeouts. The ingress controller was fine, pods were running, but nothing could talk to anything else. It was DNS.
Three months earlier, we had introduced a new service mesh that relied on CoreDNS for service discovery. We had tested it, obviously. But we hadn't tested what happens when a DNS pod is temporarily unreachable. Chaos engineering taught us that lesson the hard way.
The Hidden Assumption: DNS Is Always Fast
DNS queries in Kubernetes are cheap. They usually complete in microseconds. But under load, or when a node goes down, DNS can become slow or drop packets. Most applications have a default resolver that waits 5 seconds before failing. In a microservice architecture, a single slow DNS query can cascade into a full-blown timeout storm.
We had configured our sidecar proxies to use CoreDNS with a 1-second timeout. That seemed generous. But we had never tested what happens when CoreDNS is actually down for 30 seconds.
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: dns-kill-experiment
namespace: chaos-testing
spec:
action: pod-kill
mode: one
selector:
namespaces:
- kube-system
labelSelectors:
k8s-app: kube-dns
duration: '30s'
scheduler:
cron: '@every 10m'Start with a single pod kill in a non-production namespace. Monitor application logs and metrics for at least 10 minutes after the experiment ends.
The Experiment That Changed Everything
We ran the above experiment in our staging environment. The first few executions went unnoticed — our services continued to work. But on the fourth run, we saw a spike in HTTP 503 errors from a critical billing service. The logs showed that the service's gRPC client had failed to resolve the payment-gateway endpoint. The DNS query timed out after 5 seconds (the default resolver), and the client retried three times, causing a 15-second delay. The upstream service had already timed out by then.
The root cause? The billing service had a custom resolver with a 5-second timeout, not the 1-second we thought. The code had been written by a contractor who used the standard library's default resolver. We had never noticed because DNS was always fast.
The Silent Cascade: How a 30-Second DNS Kill Caused a 15-Minute Outage
- 00:00Chaos experiment kills one CoreDNS pod.
- 00:01CoreDNS pod is rescheduled by Kubernetes, but the new pod takes 20 seconds to become ready.
- 00:02Billing service makes a DNS query for payment-gateway. The query hits the remaining CoreDNS pod, which is overloaded due to traffic spike from other restarted pods.
- 00:03DNS query times out after 5 seconds. Billing service retries, causing cascading timeouts in dependent services.
- 00:05Payment gateway rejects all requests due to repeated failures. Orders start failing.
- 00:10Engineers manually scale up CoreDNS, but the billing service has already exhausted retries.
- 00:15Billing service is restarted, and DNS cache is cleared. Services recover.
Lesson
A single pod failure exposed a hidden dependency on default DNS timeouts. We now enforce a 1-second timeout across all services and run chaos experiments weekly.
How We Fixed It — And Made It a Practice
After the incident, we implemented three changes. First, we added a DNS timeout middleware to all service meshes. Second, we introduced a chaos experiment pipeline that runs every night in staging. Third, we wrote a simple Go tool that simulates DNS failures at the application level — because not all services run on Kubernetes.
Here's the Go tool we use to test DNS resilience in any environment:
package main
import (
"fmt"
"net"
"os"
"time"
)
func main() {
if len(os.Args) < 2 {
fmt.Println("Usage: dns-chaos <hostname> [timeout_ms]")
os.Exit(1)
}
hostname := os.Args[1]
timeout := 1000 // default 1 second
if len(os.Args) >= 3 {
fmt.Sscanf(os.Args[2], "%d", &timeout)
}
resolver := net.Resolver{
PreferGo: true,
Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
d := net.Dialer{Timeout: time.Duration(timeout) * time.Millisecond}
return d.DialContext(ctx, network, address)
},
}
start := time.Now()
ips, err := resolver.LookupHost(context.Background(), hostname)
elapsed := time.Since(start)
if err != nil {
fmt.Printf("FAIL: %v (took %v)\n", err, elapsed)
os.Exit(1)
}
fmt.Printf("OK: %v (took %v)\n", ips, elapsed)
}Do not run chaos experiments in production without a rollback plan and alerting. Start with a small blast radius — kill a single pod, not a whole node.
Building a Culture of Resilience
Chaos engineering is not a one-time project. It's a practice. We now have a weekly "Game Day" where engineers volunteer to run experiments. The rule: you must fix at least one issue you find. This has shifted our mindset from "it works" to "we know how it fails."
The DNS outage was a turning point. We learned that assumptions are the real enemy. Chaos engineering forces you to test those assumptions in a controlled way. If you haven't killed a pod today, you don't know if your system can survive.
Reduction in unplanned outages after 6 months of weekly chaos experiments
Getting Started Without Breaking Everything
- 1Pick a single service and a single failure mode (e.g., "kill one pod of service X").
- 2Set up monitoring dashboards for that service's latency, error rate, and resource usage.
- 3Run the experiment in a staging environment first. Observe the behavior.
- 4Document what you learned. Share it with the team.
- 5Gradually increase the blast radius — kill multiple pods, introduce network latency, partition nodes.
- 6Automate the experiments and run them on a schedule.
The goal is not to break things. The goal is to understand the system's failure modes before they happen in production.
Chaos engineering gave us the confidence to deploy faster. We know our system can handle a pod failure, a slow DNS, or a network partition. And when something unexpected happens, we have the data to debug it quickly.
Start small. Kill one pod. See what happens. You'll be surprised.
Frequently asked questions
What is chaos engineering?
Chaos engineering is the practice of intentionally introducing failures into a system to test its resilience and uncover weaknesses before they cause real outages.
How is chaos engineering different from traditional testing?
Traditional testing focuses on expected inputs and outputs. Chaos engineering explores unexpected failures — like a node going down or a network partition — and observes how the system behaves as a whole.
What tools are used for chaos engineering?
Popular tools include Chaos Mesh (for Kubernetes), LitmusChaos, Gremlin, and AWS Fault Injection Simulator. Each supports different types of fault injection.
When should you not run chaos experiments?
Avoid running experiments on systems without proper monitoring and rollback plans. Never experiment on a system that cannot tolerate any failure — start in a staging environment.