What this usually means
An HTTP connection timeout means the client's TCP three-way handshake never completed. The SYN packet left the client, but no SYN-ACK came back. This is distinct from a read timeout (data stalled after connection) or a connection refused (RST received). The root cause is usually one of: (1) the server never saw the SYN (firewall/network drop), (2) the server saw it but couldn't respond (backlog full, kernel drop), (3) the SYN-ACK was sent but dropped on return path, or (4) the client's route is asymmetric, causing SYN-ACK to be ignored. In production, the most common culprit is a stateful firewall or security group that silently drops packets, often due to conntrack table exhaustion or rate limiting. Second is the server's TCP listen backlog being too small for the connection rate, leading to SYN drops after the queue fills. Third is a mismatch in TCP keepalive or time-wait settings that cause resource exhaustion on either side.
The first ten minutes — establish facts before touching code.
- 1Run 'tcpdump -i any host <server-ip> and tcp[tcpflags] & (tcp-syn|tcp-syn-ack) != 0' on the client during a timeout to confirm SYN is sent and SYN-ACK never appears.
- 2On the server, run 'ss -tlnp | grep :<port>' to verify the service is listening and check the Recv-Q (backlog queue depth). If Recv-Q is near the configured limit, the backlog is full.
- 3Check conntrack on the server: 'conntrack -S' and look for 'insert_failed' or 'drop' counters increasing. If yes, the connection tracking table is full.
- 4Test with a direct IP connection bypassing DNS and proxies: 'curl -v --connect-timeout 10 http://<server-ip>:<port>' and compare with hostname. If direct works but hostname fails, suspect DNS or proxy.
- 5Check firewall rules on both client and server: on iptables/nftables, look for rules that drop or reject NEW state connections. Run 'iptables -L -n -v' and observe packet/byte counters for drops.
- 6Measure RTT between client and server with 'ping -c 10 <server-ip>'. If packet loss >0.1% or RTT spikes, network issues are likely.
- 7On the server, check kernel TCP parameters: 'sysctl net.core.somaxconn', 'net.ipv4.tcp_max_syn_backlog', 'net.core.netdev_max_backlog'. If these are low relative to connection rate, increase them.
The specific files, logs, configs, and dashboards that usually own this bug.
- search/var/log/syslog or /var/log/messages for kernel conntrack or firewall drop messages
- search/proc/net/stat/conntrack for conntrack statistics
- search/sys/kernel/debug/tracing/trace_pipe if using ftrace to trace TCP drops (requires root)
- searchApplication server logs for connection accept rate (e.g., nginx 'accept_mutex' or gunicorn worker timeouts)
- searchCloud provider security group / firewall logs (e.g., AWS VPC Flow Logs, Azure NSG flow logs)
- searchLoad balancer access logs and backend health check logs (look for 'unhealthy' flapping)
- searchKernel message buffer: 'dmesg -T | grep -i timeout' for any TCP related errors
Practical causes, not theory. These are the things you will actually find.
- warningStateful firewall connection tracking table exhausted (conntrack_max too low or conntrack entries not aging fast enough)
- warningServer TCP listen backlog (somaxconn / tcp_max_syn_backlog) too small for the rate of incoming connections, causing kernel to drop SYN packets
- warningNetwork path has a router or firewall that drops packets based on rate limiting or IP reputation (especially cloud NAT gateways)
- warningClient-side TCP timeout settings too aggressive (connect_timeout set lower than the round-trip time + server processing time)
- warningMisconfigured proxy or load balancer that terminates connections prematurely or has a connection pool with stale entries
- warningServer overloaded: process is stuck handling previous requests, not calling accept() fast enough, leading to backlog overflow
- warningHardware offload issues: TCP segmentation offload (TSO) or generic receive offload (GRO) causing packet drops on certain NICs
Concrete fix directions. Pick the one that matches your root cause.
- buildIncrease kernel connection tracking table: 'sysctl -w net.netfilter.nf_conntrack_max=262144' and set a lower timeout for established connections: 'sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=600'
- buildRaise TCP listen backlog: 'sysctl -w net.core.somaxconn=4096' and 'sysctl -w net.ipv4.tcp_max_syn_backlog=8192', and update application config to set its own backlog (e.g., nginx 'listen ... backlog=4096')
- buildIf behind a cloud load balancer, enable 'connection draining' and increase health check intervals to reduce SYN flood from health pings.
- buildAdd explicit firewall rules to allow NEW connections from known source IPs to reduce reliance on conntrack.
- buildFor client-side, set connect_timeout to at least 3x the measured RTT (e.g., 'curl --connect-timeout 30') and implement exponential backoff retries.
- buildDisable TCP offload on the NIC: 'ethtool -K eth0 tso off gro off' and test if timeouts disappear.
- buildSwitch to connection pooling on the client (e.g., keep-alive connections) to reduce the rate of new TCP handshakes.
A fix you cannot prove is a guess. Close the loop.
- verifiedAfter the change, reproduce the original load and confirm that 'tcpdump' shows completed handshakes for all connection attempts.
- verifiedMonitor conntrack stats: 'while true; do conntrack -S | grep -E 'insert_failed|drop'; sleep 1; done' and verify zero drops over a 5-minute peak.
- verifiedCheck the server's Recv-Q on the listen socket: 'ss -tlnp sport = :<port>' and ensure Recv-Q stays near 0 or well below the backlog limit.
- verifiedRun a distributed test from multiple clients to simulate real traffic and verify no timeouts occur for a sustained period.
- verifiedCheck client application logs for any remaining timeout errors; if using retries, verify that the first attempt no longer fails.
- verifiedUse synthetic monitoring (e.g., curl + timestamp) from an external node to confirm the endpoint is consistently reachable within the expected timeout.
Things that make this bug worse or harder to find.
- warningBlindly increasing timeouts without addressing the root cause—this only masks the problem and leads to cascading failures under load.
- warningDisabling conntrack entirely (e.g., iptables -t raw -I PREROUTING -j NOTRACK) without understanding the security implications; it can open the firewall to stateful attack bypass.
- warningSetting 'net.core.somaxconn' to an extremely high value (e.g., 65535) without also raising the application's backlog; the effective backlog is the minimum of both.
- warningAssuming the timeout is on the server side when it's actually a client-side network issue; always capture traffic on both ends.
- warningUsing ping to diagnose connection timeout—ICMP is often prioritized differently and may pass even when TCP SYN is dropped.
- warningChanging TCP parameters on a production server without first testing in a staging environment that mirrors the traffic pattern.
30-Second Hang Then 502: A Conntrack Exhaustion Story
Timeline
- 09:15PagerDuty alert: 502 Bad Gateway rate spiking from 0.1% to 12% across all frontend instances.
- 09:18Check NGINX access logs on a random host. Many requests show upstream timed out (110: Connection timed out) while connecting to the Flask backend on port 8080.
- 09:22Run 'ss -tlnp sport = :8080' on the backend host. Recv-Q is 511 for the listen socket, which matches somaxconn (512).
- 09:25Check conntrack: 'conntrack -S' shows insert_failed=3452 and drop=1290 in the last minute. nf_conntrack_max is 65536, but current count is 65400.
- 09:27Check network interfaces: 'ethtool -S eth0' shows no drops on NIC. tcpdump confirms SYN packets arriving but no SYN-ACK sent when conntrack is full.
- 09:30Temporary fix: set 'sysctl -w net.netfilter.nf_conntrack_max=262144' and lower timeouts. Conntrack count drops to 45000 within 30 seconds. Error rate drops to 0.5%.
- 09:45Permanent fix: Update conntrack parameters in /etc/sysctl.conf and add a firewall rule to allow established connections from the frontend subnet without tracking new ones.
- 10:00Monitor for 1 hour: zero timeouts. Incident resolved.
The Tuesday morning traffic spike hit us like a brick wall. Our frontend instances, which had been humming along at 2000 req/s, suddenly started throwing 502s for 12% of requests. The error message in NGINX logs read 'upstream timed out (110: Connection timed out) while connecting to upstream'. That's a connect timeout, not a read timeout—something between NGINX and Flask was breaking the TCP handshake.
I SSH'd into one of the backend hosts and ran ss -tlnp. The Flask listen socket on port 8080 had a Recv-Q of 511 out of a backlog of 512 (set by somaxconn). That's a red flag: the kernel was dropping SYN packets because the accept queue was full. But why was the queue full? Flask wasn't overloaded—CPU was 30%. Then I checked conntrack -S and saw insert_failed skyrocketing. The conntrack table was completely full (65k entries, max was 65k). Every new connection from NGINX hit a conntrack insert failure, and the kernel dropped the SYN before it even reached the accept queue.
The root cause was a classic: we had recently increased the number of frontend instances without adjusting conntrack_max. Each frontend kept many idle keepalive connections to the backend, and with the new frontends, we exhausted the conntrack table. The temporary fix was to increase conntrack_max and lower the timeout for established connections. The permanent fix was to add a NOTRACK rule for traffic between the frontend and backend subnets, since our security groups already filtered by IP. We also raised somaxconn to 4096 for headroom. Lesson learned: conntrack is a hidden resource that scales with the number of concurrent flows, not just request rate.
Root cause
nf_conntrack table exhausted (insert_failed) because the number of concurrent TCP connections from frontend instances exceeded nf_conntrack_max. This caused the kernel to drop new SYN packets before they reached the application's listen backlog.
The fix
Increased nf_conntrack_max to 262144 and reduced conntrack timeout for established connections to 600 seconds. Added iptables rule to bypass connection tracking for traffic on the backend subnet: -A PREROUTING -s <frontend-subnet> -d <backend-subnet> -p tcp --dport 8080 -j NOTRACK. Also increased somaxconn to 4096.
The lesson
Always monitor conntrack utilization alongside CPU and memory. When scaling out client-side connections, conntrack_max must be scaled proportionally. Also, consider using raw table NOTRACK for trusted internal traffic to avoid stateful tracking overhead.
The TCP three-way handshake has three phases: 1) Client sends SYN, enters SYN_SENT state. 2) Server receives SYN, sends SYN-ACK, enters SYN_RCVD state. 3) Client receives SYN-ACK, sends ACK, connection established. A connect timeout occurs when the client stays in SYN_SENT for its configured timeout (usually 20-120 seconds) without receiving a SYN-ACK. This can happen at any point: the SYN may be dropped before reaching the server (network/firewall), the server may not respond because its backlog is full (kernel drops SYN), or the SYN-ACK may be dropped on return (asymmetric routing, firewall state mismatch).
To isolate which phase fails, run tcpdump on both sides simultaneously. On the client: 'tcpdump -i any host <server> and tcp[tcpflags] & (tcp-syn|tcp-syn-ack) != 0'. On the server: 'tcpdump -i any host <client> and tcp[tcpflags] & (tcp-syn|tcp-syn-ack) != 0'. If client sees SYN sent but server sees no SYN, the network or firewall drops outbound. If server sees SYN and sends SYN-ACK but client sees no SYN-ACK, either the SYN-ACK was dropped or the client ignored it (e.g., due to conntrack invalid state).
Linux connection tracking (conntrack) is a stateful firewall component that tracks all TCP connections. When a new SYN arrives, conntrack creates an entry in its table. If the table is full (nf_conntrack_max), the kernel drops the SYN packet – silently. This manifests as a connect timeout on the client because the server never sends a SYN-ACK. The application's listen backlog might be empty, but the SYN never reaches it. Check conntrack statistics: 'conntrack -S' shows current count, max, and insert_failed counter. Also check 'dmesg' for 'nf_conntrack: table full, dropping packet' messages (rate-limited by default).
The default nf_conntrack_max on many distributions is 65536 or 262144. Each TCP connection consumes about 350 bytes of kernel memory. With many idle keepalive connections, the table fills quickly even if the actual traffic is low. Mitigations: increase conntrack_max, reduce conntrack timeouts (especially for TIME_WAIT and CLOSE_WAIT), or use raw table rules to bypass conntrack for trusted internal subnets. Example raw rule: 'iptables -t raw -A PREROUTING -s 10.0.0.0/8 -d 10.0.0.0/8 -p tcp --dport 8080 -j NOTRACK'.
Stateful firewalls (iptables conntrack, AWS security groups, Azure NSGs) track the state of each connection. They must see the three-way handshake to associate a flow. If a SYN-ACK is sent but the firewall missed the original SYN (e.g., asymmetric routing), it may drop the SYN-ACK as 'invalid'. This causes the client to see a timeout. Similarly, if a security group rule only allows inbound traffic from a specific source, but the return traffic goes through a different path, the firewall may not recognize the response. Enable VPC Flow Logs (AWS) or NSG flow logs (Azure) to see if packets are accepted or rejected.
Another common issue: network address translation (NAT) gateways or proxy servers that have connection limits. For example, AWS NAT Gateway supports up to 65,535 concurrent connections per destination IP:port. Exceeding this causes new SYNs to be dropped. Check NAT Gateway metrics for 'ActiveConnectionCount' and 'PacketsDropCount'. If you see drops, either scale horizontally (more NAT gateways) or reduce idle connection timeouts.
Frequently asked questions
What is the difference between a connection timeout and a read timeout in HTTP?
A connection timeout occurs when the client fails to establish a TCP connection to the server—the three-way handshake didn't complete. A read timeout occurs after the connection is established, when the server takes too long to send a response. Connection timeouts typically produce errors like 'Connection timed out' or 'connect() timed out', while read timeouts produce 'Read timed out' or 'upstream timed out while reading response header'. Debugging focuses on network, firewall, and listen backlog for connection timeouts; on server performance and bandwidth for read timeouts.
Why does increasing connect_timeout sometimes fix the issue temporarily?
Increasing connect_timeout gives the network more time to complete the handshake. If the root cause is transient network congestion or a slow server that eventually accepts the connection, a longer timeout masks the symptom. However, if the root cause is a persistent drop (e.g., firewall always dropping SYNs), increasing the timeout just makes the client hang longer before failing. A proper fix identifies why the handshake fails and addresses that, rather than just extending the wait.
How do I check if conntrack is dropping packets?
Run 'conntrack -S' and look at 'insert_failed' and 'drop' counters. If these are increasing, conntrack is dropping new connections because the table is full. Also check kernel messages: 'dmesg -T | grep conntrack' or look in /var/log/kern.log for 'nf_conntrack: table full, dropping packet'. You can also monitor the current conntrack count: 'cat /proc/sys/net/netfilter/nf_conntrack_count' and compare to 'nf_conntrack_max'.
Can a reverse proxy or load balancer cause connection timeouts?
Yes. Load balancers often have connection pools and timeouts. If the pool is full or connections are stale, new connections may be queued or dropped. For example, NGINX's upstream module has a 'proxy_connect_timeout' that defaults to 60 seconds. If the backend is slow to accept, NGINX may timeout. Also, some load balancers (like HAProxy) have a 'timeout connect' setting. Additionally, if the load balancer's connection table is full (like conntrack), it may drop new connections. Always check the load balancer's logs and statistics for connection errors.
What TCP kernel parameters should I tune for high connection rates?
For high connection rates, increase the TCP listen backlog: 'net.core.somaxconn' (default 128, raise to 4096 or more) and 'net.ipv4.tcp_max_syn_backlog' (default 1024, raise to 8192). To handle many concurrent connections, increase 'net.ipv4.tcp_mem' and 'net.ipv4.tcp_rmem' / 'tcp_wmem' if needed. If you see timeouts due to connection tracking, increase 'net.netfilter.nf_conntrack_max' and lower 'net.netfilter.nf_conntrack_tcp_timeout_established'. Also, 'net.ipv4.tcp_tw_reuse' (on the client) can help reduce TIME_WAIT connections. Always test changes in staging first.