What this usually means
Security groups are stateful firewalls that evaluate inbound and outbound rules independently. The most common cause is a missing inbound rule for the desired protocol/port, or an outbound rule that blocks return traffic. However, security groups are not the only layer: Network ACLs (NACLs) are stateless and can override security group allows. Also, security group rules are eventually consistent — changes can take a few seconds to propagate. A less obvious cause is that a security group attached to an ENI (Elastic Network Interface) may have been replaced during a configuration update, leaving a dangling rule set. Finally, connection tracking can cause stale sessions to be blocked if the security group rule is updated mid-connection.
The first ten minutes — establish facts before touching code.
- 1Run `aws ec2 describe-security-groups --group-ids <sg-id>` to dump the exact rules in JSON and verify the rule you think exists actually exists.
- 2Use `aws ec2 describe-network-interfaces --filters Name=group-id,Values=<sg-id>` to confirm the security group is attached to the correct ENI.
- 3Perform a telnet or netcat test from the source to the destination IP and port: `nc -zv <dst-ip> <port>`. If it times out, it's likely a firewall issue.
- 4Check the destination instance's security group outbound rules: ensure outbound rules allow return traffic (ephemeral ports 1024-65535) for the protocol.
- 5Review the NACL associated with the subnet of both source and destination: NACLs are stateless and can block traffic even if security groups allow it.
- 6Look at VPC Flow Logs for the specific source/destination IP and port. Log records with 'ACCEPT' after security group change? If not, check NACL.
- 7Check if the security group has a self-referencing rule (e.g., allow all traffic from the same SG) — this is often needed for internal communication.
The specific files, logs, configs, and dashboards that usually own this bug.
- searchAWS Console > EC2 > Security Groups > Inbound/Outbound rules
- searchAWS CLI: `aws ec2 describe-security-groups`
- searchAWS Console > VPC > Network ACLs
- searchVPC Flow Logs in CloudWatch Logs (log group name pattern: vpc-flow-logs-*)
- searchCloudTrail events for successful/unsuccessful security group changes (event names: AuthorizeSecurityGroupIngress, RevokeSecurityGroupEgress)
- searchSource and destination EC2 instance's OS-level firewall (iptables -L -n on Linux, Windows Firewall)
- searchInstance metadata: `curl http://169.254.169.254/latest/meta-data/security-groups` to verify SG association from inside the instance
Practical causes, not theory. These are the things you will actually find.
- warningInbound rule missing for the specific port/protocol (most common).
- warningOutbound rule missing for ephemeral ports (1024-65535) blocking return traffic.
- warningSecurity group attached to the wrong ENI (e.g., after an instance replacement).
- warningNetwork ACL (NACL) blocking traffic due to stateless rules (allow inbound but deny outbound return traffic).
- warningSecurity group rule change not yet propagated due to eventual consistency (rare but happens).
- warningSelf-referencing security group rule missing — required for instances to talk to each other using the same SG.
- warningStale connection state: a rule was changed while a connection was active, and the new rule doesn't allow the existing session.
Concrete fix directions. Pick the one that matches your root cause.
- buildAdd missing inbound or outbound rule for the required protocol and port (e.g., TCP/3306 for MySQL).
- buildAdd outbound rule allowing ephemeral ports (1024-65535) with protocol TCP/UDP if using stateful traffic.
- buildAttach the correct security group to the ENI: `aws ec2 modify-network-interface-attribute --network-interface-id <eni-id> --groups <sg-id>`.
- buildUpdate NACL rules: add inbound and outbound allow rules for the required traffic. Remember NACLs are stateless — both directions must be explicit.
- buildWait for eventual consistency (up to 60 seconds) or force refresh by modifying the security group (add and remove a dummy rule).
- buildAdd a self-referencing inbound rule (source = the same security group) to allow traffic between instances in the same SG.
- buildRestart the application or close existing connections to clear stale state, or use a connection pool that handles retries.
A fix you cannot prove is a guess. Close the loop.
- verifiedRun telnet/nc test from source to destination: should connect successfully.
- verifiedCheck VPC Flow Logs for 'ACCEPT' records for the traffic flow (both directions).
- verifiedVerify application health checks pass (e.g., ALB target group shows healthy).
- verifiedUse `ss -tlnp` on the destination instance to confirm the service is listening on the expected port.
- verifiedTemporarily attach a 'wide-open' security group (allow all traffic) to the destination to isolate the issue to security group rules.
- verifiedMonitor CloudWatch metrics for the service (e.g., ALB request count, target response time) to confirm traffic flows.
Things that make this bug worse or harder to find.
- warningOnly checking inbound rules and forgetting outbound rules (especially for return traffic).
- warningConfusing security group with NACL: security groups are stateful, NACLs are not. Don't apply NACL logic to SG rules.
- warningAssuming security group changes are instant — they are eventually consistent; wait a few seconds before retesting.
- warningModifying security group rules while there are active connections can cause those connections to drop.
- warningUsing ICMP ping to test connectivity when the security group doesn't allow ICMP. Use TCP-based tests on the actual application port.
- warningOverlooking that the security group is attached to the instance's primary ENI but the traffic uses a secondary ENI (e.g., for multi-homed instances).
The Phantom Block: How a Stale ENI Caused a 45-Minute Outage
Timeline
- 14:00Deployment triggered via CodeDeploy to replace an ALB target group's EC2 instances.
- 14:02New instances launched and registered with ALB; health checks start failing with 503.
- 14:05Engineer checks security group inbound rules — port 443 allowed from ALB security group. Looks correct.
- 14:10Engineer telnets from ALB to instance on port 443: connection refused. Same from a jump box.
- 14:15Checked instance-level firewall (iptables): all ACCEPT. No blocks.
- 14:20Examined NACLs: both inbound and outbound allow all traffic (0.0.0.0/0).
- 14:25VPC Flow Logs show 'REJECT' for the traffic from ALB to instance on port 443. Source is ALB's private IP.
- 14:30Discovered that the new instances were launched with a different launch template that used a secondary ENI with a stale security group.
- 14:35Modified the secondary ENI's security group to the correct one. Health checks passed immediately.
- 14:45Confirmed all traffic flows; deployment completed.
We were doing a routine blue/green deployment. The new instances came up, registered with the ALB, but health checks immediately failed. I checked security group inbound rules — the ALB security group was listed as a source for port 443. So that should work. I tried telnet from a jump box in the same VPC — connection refused. That ruled out ALB-specific issues.
I spent 20 minutes chasing false leads: iptables was empty, NACLs were wide open. I pulled VPC Flow Logs and saw 'REJECT' records for the traffic from the ALB's private IP to the instance on port 443. The reject was coming from the security group layer, not NACL. But the inbound rules looked correct. Something was off.
Then I noticed the instance had two ENIs. The primary ENI had the correct security group, but the secondary ENI — used for the application traffic — had a security group that did not include the ALB's security group. The launch template had been updated to attach a secondary ENI, but the security group for that ENI was outdated. One modification later, health checks passed. The lesson: always verify which ENI the traffic actually traverses, not just the instance's security groups.
Root cause
Secondary ENI on the new EC2 instances had an incorrect security group that did not allow inbound traffic from the ALB.
The fix
Modified the security group attached to the secondary ENI to include the ALB security group as a source for port 443.
The lesson
When using multiple ENIs, each ENI has its own security group. Traffic flows through the ENI based on routing; always verify the security group on the ENI that handles the application traffic.
Security groups are stateful: if you allow inbound TCP traffic, the outbound return traffic is automatically allowed, regardless of outbound rules. However, this only works if the connection is tracked. Security groups use connection tracking to associate return packets with an existing flow. If you change a rule while there are active connections, those connections may be dropped because the tracking table is updated. This is why you sometimes see mid-connection drops after a rule change.
The evaluation order for security groups is not a priority list; all rules are evaluated (both allow and deny — but deny is implicit). Actually, security groups only have allow rules; everything else is implicitly denied. Rules are evaluated holistically: if any rule matches the traffic, it's allowed. There is no order-based precedence. This means you cannot have a 'deny' rule; you can only selectively allow. If you need to block specific traffic, you must use NACLs.
Network ACLs are stateless and operate at the subnet level. They have explicit allow and deny rules, evaluated in order (lowest number first). A common pitfall is that even if a security group allows traffic, a NACL can block it. For example, if your NACL inbound allows TCP/443 from 0.0.0.0/0 but the outbound NACL blocks return traffic (ephemeral ports), the connection will fail.
When debugging blocked traffic, always check both SG and NACL. VPC Flow Logs are invaluable here: they record 'ACCEPT' or 'REJECT' and specify which layer (SG or NACL) made the decision. Look for log entries where 'action' is 'REJECT' and the 'log-status' indicates 'NODATA' or 'SKIPDATA' — but usually you'll see 'ACCEPT' or 'REJECT'. If the log shows 'REJECT' and the security group appears correct, the NACL is likely the culprit.
AWS security groups maintain a connection tracking table for each instance. This table tracks active flows. When a packet arrives, the SG checks the tracking table first; if a matching flow exists, the packet is allowed regardless of rules. If no flow exists, the SG evaluates the rules. This means that if you add a rule to allow a new connection, it may take effect immediately, but if you remove a rule, existing flows continue until they time out (typically 5-10 minutes for TCP).
To force clear connection tracking, you can stop and start the instance, or use tools like `conntrack -D` on Linux (requires privileged access). In practice, if you need to block existing connections quickly, you must modify the NACL instead, because NACLs are stateless and do not track connections.
Each Elastic Network Interface (ENI) can have up to 5 security groups attached. When an EC2 instance has multiple ENIs, traffic destined to a specific IP may ingress through any ENI depending on routing. The security group of the ENI that receives the packet is evaluated. It's common to attach a secondary ENI for management traffic and a primary for application traffic. If you only update security groups on the primary ENI, but the application traffic uses the secondary, you'll see blocked traffic.
To verify which ENI a packet uses, check the source IP's routing table on the instance (`ip route show`) and the VPC route tables. Also, look at the destination IP's subnet mapping. In the VPC, if the instance has multiple ENIs in different subnets, traffic from one subnet may arrive on the ENI in that subnet. Always attach the correct security group to every ENI that handles traffic.
Frequently asked questions
Why can I ping an instance within the same security group but not SSH?
Security groups evaluate each protocol/port independently. ICMP (ping) may be allowed by a rule that permits all ICMP types, while TCP/22 (SSH) may not be allowed. Check inbound rules for both protocols. Additionally, if the instance has a self-referencing rule for all traffic, ping might work because ICMP is included, but SSH may require a specific rule if the self-referencing rule only allows ICMP (unlikely, but possible). The most common cause is that the inbound rule for SSH is missing.
How long does it take for security group changes to propagate?
Changes are eventually consistent, typically taking a few seconds to apply across all instances. In rare cases, it can take up to 60 seconds. If you need immediate effect, you can modify the security group by adding and removing a dummy rule to force a refresh. However, for production environments, design your application to handle short propagation delays.
Can I use a security group to block a specific IP address?
Security groups only support allow rules. You cannot create a deny rule. To block specific IPs, use a Network ACL (NACL) with a deny rule for that IP. Alternatively, you can use a security group that does not include the source IP in any allow rule, but this will block all traffic from that IP indirectly. If you need to block only certain traffic from an IP while allowing other traffic, you must use NACLs.
Why does my ALB health check fail even though I allowed the ALB security group in the target instance's inbound rules?
First, verify that the ALB security group ID is correctly referenced as the source, not the ALB's CIDR. Also, ensure that the target instance's outbound rules allow return traffic (ephemeral ports). If the instance is behind a NAT gateway or has multiple ENIs, check that the traffic actually flows through the ENI with the correct security group. Additionally, confirm that the target instance's application is listening on the health check port (e.g., 80 or 443). Use `netstat -tlnp` to verify.
What's the difference between a security group and a network ACL?
Security groups are stateful firewalls attached to ENIs (instances). They support only allow rules, and rules are evaluated as a whole (no priority). Return traffic is automatically allowed. Network ACLs are stateless firewalls attached to subnets. They support both allow and deny rules, evaluated in order (lowest number first). Return traffic must be explicitly allowed. Use security groups for instance-level protection and NACLs for subnet-level additional layers.