Engineering process12 min read

Chaos Engineering at 3 AM: When Our Kubernetes Cluster Lost Its DNS

We ran a chaos experiment on our Kubernetes cluster and found a subtle DNS misconfiguration that would have caused a cascading failure. Here's how we did it and what we learned.

chaos engineeringkubernetesDNSfailure testingsite reliability

I woke up to a cascade of PagerDuty alerts. Our Kubernetes cluster was healthy, but every service was reporting timeouts. The ingress controller was fine, pods were running, but nothing could talk to anything else. It was DNS.

Three months earlier, we had introduced a new service mesh that relied on CoreDNS for service discovery. We had tested it, obviously. But we hadn't tested what happens when a DNS pod is temporarily unreachable. Chaos engineering taught us that lesson the hard way.

The Hidden Assumption: DNS Is Always Fast

DNS queries in Kubernetes are cheap. They usually complete in microseconds. But under load, or when a node goes down, DNS can become slow or drop packets. Most applications have a default resolver that waits 5 seconds before failing. In a microservice architecture, a single slow DNS query can cascade into a full-blown timeout storm.

We had configured our sidecar proxies to use CoreDNS with a 1-second timeout. That seemed generous. But we had never tested what happens when CoreDNS is actually down for 30 seconds.

Example Chaos Mesh experiment to kill one CoreDNS pod for 30 seconds every 10 minutes.
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: dns-kill-experiment
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - kube-system
    labelSelectors:
      k8s-app: kube-dns
  duration: '30s'
  scheduler:
    cron: '@every 10m'
lightbulb

Start with a single pod kill in a non-production namespace. Monitor application logs and metrics for at least 10 minutes after the experiment ends.

The Experiment That Changed Everything

We ran the above experiment in our staging environment. The first few executions went unnoticed — our services continued to work. But on the fourth run, we saw a spike in HTTP 503 errors from a critical billing service. The logs showed that the service's gRPC client had failed to resolve the payment-gateway endpoint. The DNS query timed out after 5 seconds (the default resolver), and the client retried three times, causing a 15-second delay. The upstream service had already timed out by then.

The root cause? The billing service had a custom resolver with a 5-second timeout, not the 1-second we thought. The code had been written by a contractor who used the standard library's default resolver. We had never noticed because DNS was always fast.

The Silent Cascade: How a 30-Second DNS Kill Caused a 15-Minute Outage

  1. 00:00Chaos experiment kills one CoreDNS pod.
  2. 00:01CoreDNS pod is rescheduled by Kubernetes, but the new pod takes 20 seconds to become ready.
  3. 00:02Billing service makes a DNS query for payment-gateway. The query hits the remaining CoreDNS pod, which is overloaded due to traffic spike from other restarted pods.
  4. 00:03DNS query times out after 5 seconds. Billing service retries, causing cascading timeouts in dependent services.
  5. 00:05Payment gateway rejects all requests due to repeated failures. Orders start failing.
  6. 00:10Engineers manually scale up CoreDNS, but the billing service has already exhausted retries.
  7. 00:15Billing service is restarted, and DNS cache is cleared. Services recover.

Lesson

A single pod failure exposed a hidden dependency on default DNS timeouts. We now enforce a 1-second timeout across all services and run chaos experiments weekly.

How We Fixed It — And Made It a Practice

After the incident, we implemented three changes. First, we added a DNS timeout middleware to all service meshes. Second, we introduced a chaos experiment pipeline that runs every night in staging. Third, we wrote a simple Go tool that simulates DNS failures at the application level — because not all services run on Kubernetes.

Here's the Go tool we use to test DNS resilience in any environment:

A Go utility to test DNS resolution with a configurable timeout. Use it in CI/CD to validate that your services handle slow DNS gracefully.
package main

import (
	"fmt"
	"net"
	"os"
	"time"
)

func main() {
	if len(os.Args) < 2 {
		fmt.Println("Usage: dns-chaos <hostname> [timeout_ms]")
		os.Exit(1)
	}
	hostname := os.Args[1]
	timeout := 1000 // default 1 second
	if len(os.Args) >= 3 {
		fmt.Sscanf(os.Args[2], "%d", &timeout)
	}

	resolver := net.Resolver{
		PreferGo: true,
		Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
			d := net.Dialer{Timeout: time.Duration(timeout) * time.Millisecond}
			return d.DialContext(ctx, network, address)
		},
	}

	start := time.Now()
	ips, err := resolver.LookupHost(context.Background(), hostname)
	elapsed := time.Since(start)

	if err != nil {
		fmt.Printf("FAIL: %v (took %v)\n", err, elapsed)
		os.Exit(1)
	}
	fmt.Printf("OK: %v (took %v)\n", ips, elapsed)
}
warning

Do not run chaos experiments in production without a rollback plan and alerting. Start with a small blast radius — kill a single pod, not a whole node.

Building a Culture of Resilience

Chaos engineering is not a one-time project. It's a practice. We now have a weekly "Game Day" where engineers volunteer to run experiments. The rule: you must fix at least one issue you find. This has shifted our mindset from "it works" to "we know how it fails."

The DNS outage was a turning point. We learned that assumptions are the real enemy. Chaos engineering forces you to test those assumptions in a controlled way. If you haven't killed a pod today, you don't know if your system can survive.

47%

Reduction in unplanned outages after 6 months of weekly chaos experiments

Getting Started Without Breaking Everything

  1. 1Pick a single service and a single failure mode (e.g., "kill one pod of service X").
  2. 2Set up monitoring dashboards for that service's latency, error rate, and resource usage.
  3. 3Run the experiment in a staging environment first. Observe the behavior.
  4. 4Document what you learned. Share it with the team.
  5. 5Gradually increase the blast radius — kill multiple pods, introduce network latency, partition nodes.
  6. 6Automate the experiments and run them on a schedule.
info

The goal is not to break things. The goal is to understand the system's failure modes before they happen in production.

Chaos engineering gave us the confidence to deploy faster. We know our system can handle a pod failure, a slow DNS, or a network partition. And when something unexpected happens, we have the data to debug it quickly.

Start small. Kill one pod. See what happens. You'll be surprised.

Frequently asked questions

What is chaos engineering?

Chaos engineering is the practice of intentionally introducing failures into a system to test its resilience and uncover weaknesses before they cause real outages.

How is chaos engineering different from traditional testing?

Traditional testing focuses on expected inputs and outputs. Chaos engineering explores unexpected failures — like a node going down or a network partition — and observes how the system behaves as a whole.

What tools are used for chaos engineering?

Popular tools include Chaos Mesh (for Kubernetes), LitmusChaos, Gremlin, and AWS Fault Injection Simulator. Each supports different types of fault injection.

When should you not run chaos experiments?

Avoid running experiments on systems without proper monitoring and rollback plans. Never experiment on a system that cannot tolerate any failure — start in a staging environment.