What this usually means
Kubernetes DNS resolution failures typically stem from one of three layers: (1) the CoreDNS deployment itself is unhealthy (crash loop, resource limits, or misconfiguration), (2) the pod's resolv.conf is wrong (due to dnsPolicy or dnsConfig settings), or (3) network connectivity from the pod to the DNS service is blocked (network policies, security groups, or CNI issues). The default cluster DNS (CoreDNS) runs as a Deployment in the `kube-system` namespace, exposed via a ClusterIP service named `kube-dns`. When a pod tries to resolve, it sends a DNS query to the cluster IP of kube-dns (typically 10.96.0.10). If that service endpoint doesn't point to healthy CoreDNS pods, or if traffic on port 53 is dropped, resolution fails. Non-obvious causes include: CoreDNS ConfigMap with broken forwarders, missing custom upstream resolvers, or a corrupt liveness probe that kills CoreDNS prematurely.
The first ten minutes — establish facts before touching code.
- 1Run `kubectl get pods -n kube-system -l k8s-app=kube-dns` to check CoreDNS pod status. If any are not Running/Ready, describe them.
- 2Run `kubectl -n kube-system logs <coredns-pod>` for error messages like `Failed to get startup probes` or dial errors.
- 3From a pod in the same namespace as the failing pod, run `nslookup kubernetes.default.svc.cluster.local <kube-dns-service-ip>` or `nslookup google.com <kube-dns-service-ip>`. Replace service IP with output from `kubectl get svc -n kube-system kube-dns`.
- 4Inspect the failing pod's `/etc/resolv.conf` via `kubectl exec <pod> -- cat /etc/resolv.conf`. Expected: nameserver <kube-dns-cluster-ip>, search <namespace>.svc.cluster.local svc.cluster.local cluster.local, options ndots:5.
- 5Check if a NetworkPolicy exists in the namespace that might block egress to UDP 53 on the DNS service IP: `kubectl get networkpolicies -n <namespace>` and describe each.
The specific files, logs, configs, and dashboards that usually own this bug.
- searchCoreDNS logs: `kubectl -n kube-system logs -l k8s-app=kube-dns --tail=100`
- searchCoreDNS ConfigMap: `kubectl -n kube-system get configmap coredns -o yaml`
- searchkube-dns service endpoints: `kubectl -n kube-system get endpoints kube-dns`
- searchPod's resolv.conf: `kubectl exec <pod> -- cat /etc/resolv.conf`
- searchNetwork policies in the pod's namespace: `kubectl get networkpolicy -n <ns>`
- searchPod's DNS policy: `kubectl get pod <pod> -o json | jq .spec.dnsPolicy` (often `ClusterFirst`)
- searchCluster node sysctl: `sysctl net.netfilter.nf_conntrack_udp_timeout` (default 30s can cause timeouts under load)
Practical causes, not theory. These are the things you will actually find.
- warningCoreDNS pods are OOMKilled or hitting resource limits: check `kubectl describe pod -n kube-system <coredns-pod>` for OOMKilled.
- warningCoreDNS ConfigMap has invalid syntax or broken forward block pointing to a non-existent upstream.
- warningNetworkPolicy in the namespace blocks egress to the kube-dns service IP on UDP 53.
- warningPod's dnsPolicy set to `None` (custom dnsConfig) with missing or wrong nameservers.
- warningkube-dns service endpoints are empty because CoreDNS pods failed to register: often due to pod network issues.
- warningNode has iptables rules that drop DNS traffic, e.g., from a misconfigured firewall or security group.
- warningCoreDNS liveness probe is too aggressive, causing pod restarts before queries are answered.
Concrete fix directions. Pick the one that matches your root cause.
- buildIf CoreDNS is OOMKilled: increase memory request/limit in the CoreDNS deployment: `kubectl edit deployment -n kube-system coredns` and set `spec.template.spec.containers[0].resources.limits.memory` to at least 256Mi.
- buildIf CoreDNS ConfigMap is broken: restore from backup or recreate via `kubectl apply -f coredns-config.yaml` using the default template (see official docs).
- buildIf a NetworkPolicy blocks DNS: add an egress rule allowing UDP/TCP to the kube-dns service IP and port 53, or add a blanket allow-egress-to-kube-dns policy.
- buildIf pod's resolv.conf is wrong: set `dnsPolicy: ClusterFirst` (default) in the pod spec, or provide correct `dnsConfig`.
- buildIf kube-dns endpoints are empty: check CoreDNS pod logs for startup errors, ensure pods have network connectivity, and consider restarting CoreDNS pods.
- buildIf iptables drops: run `iptables -L -n | grep -i dns` on the node; remove offending rules or adjust network plugin configuration.
A fix you cannot prove is a guess. Close the loop.
- verifiedAfter fix, exec into the failing pod and run `nslookup kubernetes.default.svc.cluster.local` — must return the service cluster IP.
- verifiedRun `nslookup google.com` from the pod — must return an external IP (if internet egress is allowed).
- verifiedCheck that CoreDNS pods are Running and Ready: `kubectl get pods -n kube-system -l k8s-app=kube-dns`.
- verifiedVerify the kube-dns service endpoints are populated: `kubectl get endpoints -n kube-system kube-dns` should show IPs of CoreDNS pods.
- verifiedUse `kubectl run -it --rm debug --image=busybox -- sh` in the failing namespace and run `nslookup <service>` to confirm resolution works.
- verifiedMonitor CoreDNS metrics if Prometheus is available: check `coredns_dns_request_count_total` and `coredns_dns_response_rcode_count_total{rcode="SERVFAIL"}`.
Things that make this bug worse or harder to find.
- warningDo not blindly delete CoreDNS pods — this may cause transient DNS failures across the cluster.
- warningDo not modify the kube-dns service type from ClusterIP to NodePort or LoadBalancer without understanding the implications.
- warningDo not set `dnsPolicy: ClusterFirstWithHostNet` unless the pod needs host network — it bypasses kube-dns and uses node's resolv.conf.
- warningDo not assume DNS failure is always CoreDNS: check pod-level dnsConfig first.
- warningDo not ignore network policies: a deny-all egress policy will break DNS. Always allow egress to kube-dns if using restrictive policies.
- warningDo not set ndots above 5 in pod dnsConfig — can cause excessive DNS queries and resolution failures.
CoreDNS OOMKilled After Deploying 100 Microservices
Timeline
- 10:15Team deploys 100 microservices to a new namespace via GitOps pipeline.
- 10:20Alerts: many pods crash with 'Temporary failure in name resolution'.
- 10:22Check CoreDNS pods: 2 of 3 are CrashLoopBackOff.
- 10:25Describe CoreDNS pod: OOMKilled with exit code 137.
- 10:30Check CoreDNS logs: 'Failed to get startup probes: context deadline exceeded' and 'dial tcp: i/o timeout'.
- 10:35Check memory usage: CoreDNS pods had 70Mi limit, new services caused ~200Mi usage.
- 10:40Increase memory limit to 512Mi, scale up to 5 replicas, rollout restart.
- 10:45All CoreDNS pods stable. Verify DNS resolution: nslookup works.
We had just rolled out a GitOps update that added 100 microservices to a new namespace. Almost immediately, our monitoring lit up with pod crash loops. The common error across all pods was 'Temporary failure in name resolution' — classic DNS failure. I checked the CoreDNS pods in kube-system and saw two out of three were CrashLoopBackOff. That's when I knew this was cluster-wide.
I described a failing CoreDNS pod and saw 'OOMKilled' with exit code 137. Memory limit was 70Mi, but actual usage spiked to 200Mi because the new services triggered a flood of DNS queries. The coredns logs showed context deadline exceeded and dial errors because the pod was thrashing. I also noticed kube-dns endpoints had only one IP, so the surviving pod was overloaded.
I edited the CoreDNS deployment to increase memory limit to 512Mi and added two more replicas. After a rollout restart, all five pods came up healthy. I verified by execing into a failing pod and running nslookup — it resolved correctly. The lesson: always monitor CoreDNS resource usage against query load, and set appropriate limits from the start.
Root cause
CoreDNS pods hit memory limit due to increased DNS query volume from new services, causing OOMKill and crash loop.
The fix
Increased CoreDNS memory limit to 512Mi and scaled replicas from 3 to 5.
The lesson
Always set CoreDNS resource requests/limits based on expected cluster load; monitor CoreDNS memory usage during rollouts.
Every pod inherits a DNS resolver configuration from the cluster, typically through the `dnsPolicy: ClusterFirst` default. This tells the kubelet to write `/etc/resolv.conf` with the cluster DNS service IP (e.g., 10.96.0.10) as the nameserver and a search suffix like `<namespace>.svc.cluster.local svc.cluster.local cluster.local`. When a pod queries a short name like `my-service`, it first tries `my-service.<namespace>.svc.cluster.local`, then `my-service.svc.cluster.local`, etc. This is why ndots:5 matters — names with fewer than 5 dots are tried with the search domains first.
The actual DNS server is CoreDNS (or kube-dns in older clusters), which runs as a Deployment in kube-system. It reads a ConfigMap (coredns) with a Corefile that defines how to resolve queries: internal cluster domains are answered from the internal registry, and external domains are forwarded to upstream resolvers (like 8.8.8.8 or node's /etc/resolv.conf). The kube-dns service is a ClusterIP that load-balances across CoreDNS pods. If the service endpoints are missing, or if a CoreDNS pod is unhealthy, queries will time out.
Understanding this flow is critical because a failure can happen at any step: the pod's resolv.conf, the network path to the service IP, the CoreDNS pod itself, or the upstream forwarder.
The CoreDNS ConfigMap is often the culprit. Get it with `kubectl -n kube-system get configmap coredns -o yaml`. The Corefile section should have a valid 'forward' directive. A common mistake is a forward block pointing to an IP that doesn't exist or is unreachable. For example: `forward . 8.8.8.8:53` is fine, but if you put a private IP without proper network access, CoreDNS will return SERVFAIL. Check logs: `kubectl -n kube-system logs -l k8s-app=kube-dns | grep -i forward`.
Another issue is the 'kubernetes' plugin block. It must have the correct cluster domain (default `cluster.local.`). If you see `plugin/kubernetes: Failed to connect to apiserver`, CoreDNS can't reach the API server — check RBAC or network connectivity. Also, the 'health' and 'ready' plugins should be enabled for proper liveness probes. If missing, probes may fail even if CoreDNS is functioning.
A lesser-known problem: the pod's dnsConfig overriding the nameserver. If a pod sets `dnsConfig.nameservers` to a list that doesn't include the cluster DNS, or if `dnsPolicy: None` is used with incomplete configuration, resolution fails. Always verify the pod's /etc/resolv.conf matches expectations.
Network policies are a frequent non-obvious cause. If your namespace has a default deny egress policy, DNS queries to the kube-dns service IP will be blocked unless you explicitly allow it. The kube-dns service IP is typically 10.96.0.10, but it can vary. You need a policy that allows egress to that IP on UDP port 53 (and TCP 53 for large responses). Example: `kubectl get networkpolicy -n default` may show a policy with no egress rules.
To diagnose, run `kubectl exec <pod> -- nslookup kubernetes.default 10.96.0.10`. If it times out but `nslookup 8.8.8.8` works (using external DNS), the problem is likely network policy blocking cluster DNS. Also check if the kube-dns service has endpoints: `kubectl get endpoints -n kube-system kube-dns` should list CoreDNS pod IPs. If empty, CoreDNS isn't registering.
Another network-level cause: nodes have security group rules that block UDP 53 between pods. On AWS EKS, for example, the node security group must allow UDP 53 inbound from other nodes. If using Calico or Cilium with custom host-level rules, check those too.
Under high query load, CoreDNS can be evicted due to memory pressure. The default memory request is 70Mi, which is low for clusters with hundreds of services. Monitor CoreDNS memory usage with `kubectl top pod -n kube-system -l k8s-app=kube-dns`. If usage exceeds requests, consider increasing both requests and limits. Also, ensure CoreDNS pods are scheduled on different nodes for high availability.
Another performance issue is the conntrack table on nodes. UDP DNS queries create conntrack entries that may be dropped if the table is full. Check `sysctl net.netfilter.nf_conntrack_max` and the current count with `conntrack -S`. If near max, increase it or consider using TCP for DNS by setting `force_tcp` in CoreDNS config.
Finally, the `ndots` setting in pod resolv.conf affects query volume. With ndots:5, a query for a name with fewer than 5 dots (e.g., `my-svc`) triggers up to 6 DNS queries (search domains + raw name). If you have many short-named services, increase ndots or use fully qualified names to reduce load.
Frequently asked questions
Why does my pod resolve internal services but not external domains?
This usually means CoreDNS is healthy for cluster domain queries but its forward plugin cannot reach external resolvers. Check the CoreDNS ConfigMap's `forward` block: if it points to a specific upstream (e.g., 8.8.8.8), ensure that IP is reachable from the CoreDNS pod. Also check if a network policy blocks egress to external IPs. Another cause: CoreDNS is using the node's /etc/resolv.conf via `forward . /etc/resolv.conf`, but the node's resolv.conf is misconfigured or points to a resolver that doesn't allow recursion.
How do I test DNS from inside a pod without any tools?
Use `kubectl exec <pod> -- cat /etc/resolv.conf` to see the nameserver. Then use `kubectl exec <pod> -- nslookup <hostname> <nameserver>` if `nslookup` is available. If not, you can use `kubectl exec <pod> -- sh -c 'host <hostname>'` or `getent hosts <hostname>`. For a minimal HTTP-based test, use `wget -O- http://<service-name>.<namespace>.svc.cluster.local:80` — if it fails with a name resolution error, DNS is broken.
What does 'connection timed out; no servers could be reached' mean?
It means the resolver (usually CoreDNS) did not respond within the timeout period. This happens when the kube-dns service IP is unreachable from the pod (network policy, iptables drop, or the service has no endpoints) or when CoreDNS pods are too busy or unhealthy. Check if the pod can reach the kube-dns IP on port 53 using `kubectl exec <pod> -- nc -zv <cluster-dns-ip> 53` (if netcat is available). If connectivity works, the problem is likely CoreDNS itself.
How do I restart CoreDNS without affecting the cluster?
Because CoreDNS is a Deployment, you can do a rolling restart: `kubectl rollout restart -n kube-system deployment/coredns`. This will restart pods one by one, minimizing impact. However, if the root cause is resource exhaustion or a bad ConfigMap, restarting won't fix it permanently. Always fix the underlying issue first.
Can a misconfigured CNI cause DNS failures?
Yes. Some CNI plugins manage DNS resolution via the pod's /etc/resolv.conf or through node-level DNS settings. For example, Calico's `felix` may apply iptables rules that affect DNS. If the CNI plugin is incorrectly configured, it can block UDP 53 or fail to set up the pod network properly, causing DNS timeouts. Check CNI logs and node iptables rules if other debugging steps fail.