AWS API Gateway 502 Bad Gateway Debugging Guide

What this usually means

A 502 from API Gateway indicates that Gateway successfully received your request but failed to get a valid response from the integration endpoint. This is not a client-side error—it's a connectivity or response-format failure between Gateway and your backend. Common causes include: network timeouts (especially for VPC-linked backends), Lambda invocation failures (permissions or runtime errors), backend returning non-HTTP-compliant responses (e.g., empty body with wrong Content-Type), or the backend server crashing under load. The tricky part is that direct tests to the backend often succeed because they bypass Gateway's specific timeout and header constraints.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Check API Gateway CloudWatch logs: aws logs filter-log-events --log-group-name 'API-Gateway-Execution-Logs_<api-id>/<stage>' --filter-pattern '502'
2Test the backend endpoint directly with curl or telnet, including any VPC or security group rules
3Enable detailed CloudWatch metrics for 4XX and 5XX errors, then drill into the IntegrationLatency metric
4Use AWS X-Ray tracing on the API Gateway stage to see where the request fails (if X-Ray enabled)
5Inspect the integration request/response mapping templates for syntax errors or missing transformations
6Check the API Gateway integration timeout setting (max 29 seconds for Lambda, 30 for HTTP) – if backend takes longer, you'll get a 502

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

searchCloudWatch Logs: /aws/apigateway/<api-id>/<stage> – look for 'Execution failed due to integration error'
searchCloudWatch Metrics: IntegrationLatency, 5XXError for the API Gateway stage
searchAPI Gateway console: Resources → Method Execution → Integration Request/Response – verify endpoint URL and mapping templates
searchLambda CloudWatch Logs (if backend is Lambda): /aws/lambda/<function-name> – check for invocation errors or timeouts
searchVPC Flow Logs (if using VPC Link) – confirm traffic from API Gateway's VPC endpoint reaches the NLB/ALB
searchNetwork Load Balancer access logs (if using NLB with VPC Link) – check for 5xx errors from targets
searchAPI Gateway stage variables and parameter mappings – ensure variable substitution produces correct endpoints

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningVPC Link target group health checks failing or NLB is in different subnets than API Gateway's VPC endpoint
warningLambda function permissions: API Gateway's execution role missing lambda:InvokeFunction on the target function
warningBackend HTTP/S endpoint returns a non-2xx/3xx status code and API Gateway's integration response is not configured to handle it – resulting in 502
warningIntegration request/response mapping templates produce invalid output (e.g., Lambda returning a string when API expects JSON)
warningBackend connection timeout: API Gateway's 30-second timeout for HTTP integrations; Lambda has 29-second max – if your backend takes longer, you get 502
warningSecurity group rules blocking traffic from API Gateway's private IP range (for VPC Link) or from the VPC endpoint
warningBackend server returns chunked transfer encoding or other headers that API Gateway cannot parse correctly

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildIncrease API Gateway integration timeout to 29000 ms (max for Lambda) or 30000 ms (HTTP) to accommodate slower backends
buildAdd a custom Lambda integration response mapping for error status codes (e.g., map 500 to a formatted error message) to prevent 502
buildFor VPC Link: ensure the NLB's target group health check passes and the NLB's security group allows inbound from the VPC endpoint
buildFix Lambda function permissions: attach the AWSLambdaBasicExecutionRole and add a resource-based policy allowing API Gateway to invoke
buildValidate and fix integration response mapping templates – often a missing comma or incorrect JSONPath causes empty responses
buildImplement backend retry logic with exponential backoff to handle transient failures that cause 502 under load

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedSend a test request via curl with verbose headers: curl -v -X POST 'https://<api-id>.execute-api.<region>.amazonaws.com/<stage>/<resource>' -d '{}'
verifiedMonitor CloudWatch 5XXError metric drops to zero after deployment
verifiedCheck API Gateway execution logs for 'Successfully completed execution' instead of 'Execution failed'
verifiedVerify the integration response status code is the expected one (e.g., 200) not 502
verifiedUse the API Gateway test feature in the console to invoke the method and see detailed response
verifiedRun a load test with artillery or k6 to confirm the fix holds under concurrent requests

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningAssuming the backend is fine because it works when tested from a different client (e.g., local curl vs API Gateway's specific environment)
warningForgetting to update stage variables after changing backend endpoints or Lambda versions
warningNot checking the API Gateway execution role – it's easy to copy a role that lacks lambda:InvokeFunction for the specific function
warningIgnoring the 29-second Lambda timeout: you must configure both API Gateway timeout and Lambda timeout consistently
warningOverlooking VPC endpoint configuration: the VPC endpoint for API Gateway must be present in the same VPC as the NLB
warningDisabling CloudWatch logs for cost reasons – this makes debugging 502 nearly impossible

( 07 )War story

Midnight Pager: API Gateway 502 After Security Group Update

Senior Backend EngineerAWS API Gateway (REST) → VPC Link → NLB → ECS Fargate (Node.js)

Timeline

00:15PagerDuty alert: production API returning 502 for 80% of requests
00:20Checked CloudWatch metrics: 5XXError spiked from 0 to 150 per minute; IntegrationLatency around 28s
00:25Saw that a security group change was applied at 00:10 to the NLB's target group (ECS tasks)
00:30Verified that direct curl to an ECS task from a jump box succeeded (bypassed NLB)
00:35Checked NLB target health: all targets marked unhealthy; described the target group security group – new rule only allowed traffic from a different CIDR
00:40Reverted security group rule to allow inbound from the VPC endpoint's subnet CIDR; health checks passed
00:45API recovered; 5XXError dropped to zero

I was on-call when PagerDuty lit up at 12:15 AM. Our main API, which handles order processing, was returning 502 for most requests. The alert was triggered by CloudWatch alarm on 5XXError. My first instinct was to check the API Gateway logs and see the error message: 'Execution failed due to integration error'. The integration type was VPC Link to a Network Load Balancer, which forwards to ECS Fargate tasks running Node.js.

I quickly opened CloudWatch metrics and saw that IntegrationLatency was pegged at around 28 seconds, close to the 29-second timeout. That told me the backend was not responding in time. But a direct curl to an ECS task worked fine. So the problem was between the NLB and the tasks. I checked the NLB target group health – all targets unhealthy. Then I looked at the recent CloudTrail events and saw a security group modification at 00:10. Someone had accidentally updated the target group's security group to only allow traffic from a different VPC CIDR, not the one used by the VPC endpoint.

I reverted the security group rule to allow inbound from the VPC endpoint's subnet CIDR. Within a minute, health checks passed, the NLB marked targets healthy, and API requests started succeeding. The lesson: always document and review security group changes, and use automated checks to validate NLB target health after any network change. A simple mistake caused 30 minutes of downtime.

Root cause

Security group on NLB target group was updated to deny traffic from the VPC endpoint's subnet, causing health checks to fail and all targets to become unhealthy, resulting in 502.

The fix

Reverted security group to allow inbound TCP/80 from the VPC endpoint's subnet CIDR (10.0.0.0/16).

The lesson

Any change to security groups or NACLs affecting the path between API Gateway and backend should trigger automated health check validation before rolling out.

( 08 )Understanding API Gateway Timeouts and Retries

API Gateway has a hard limit of 29 seconds for Lambda integrations and 30 seconds for HTTP integrations. This is the total time from when Gateway sends the request to when it receives the complete response. If your backend takes longer, you get a 502 with 'Endpoint request timed out'. The tricky part is that the timeout counter starts when Gateway sends the request, not when the backend starts processing. So network latency counts.

To diagnose, enable detailed CloudWatch metrics and look at IntegrationLatency. If it's consistently close to 29-30 seconds, you need to either optimize the backend or increase the timeout (only possible for HTTP integrations via OpenAPI spec; Lambda max is fixed). Also, note that API Gateway does not automatically retry on 502 – you need to implement retries in the client.

( 09 )VPC Link and Network Connectivity Issues

When using VPC Link, API Gateway connects to a Network Load Balancer inside your VPC. The NLB must be in the same VPC as the VPC Link endpoint, and the security group for the NLB's target group must allow inbound traffic from the API Gateway's VPC endpoint. A common mistake is to allow the NLB's subnet CIDR but forget that the traffic originates from the VPC endpoint's elastic network interface, which has its own IP.

Check VPC Flow Logs for traffic from the VPC endpoint's ENI to the NLB. If you see 'REJECT' records, it's a security group or NACL issue. Also verify the NLB's target group health check: it should point to a health endpoint that returns 200. If health checks fail, the NLB will not route traffic, leading to 502.

( 10 )Integration Response Mapping and Malformed Backend Responses

API Gateway expects the backend to return a valid HTTP response with a status line, headers, and body. If your backend returns something that cannot be parsed (e.g., a raw string without Content-Type header, or an empty body with 200 status), API Gateway may fail and return 502. This is especially common with Lambda proxy integrations where the function must return a specific JSON structure.

For non-proxy Lambda integrations, the integration response mapping template must produce a valid HTTP response. An empty mapping template or one that outputs invalid JSON (like a trailing comma) will cause a 502. Enable CloudWatch logging at 'INFO' level to see the raw response from the backend and the mapping output.

( 11 )Lambda Permissions and Resource Policies

If your backend is a Lambda function, the API Gateway execution role must have the lambda:InvokeFunction permission on that specific function. Without it, Lambda returns a 403/502. Additionally, if the Lambda function has a resource-based policy, it must explicitly allow the API Gateway's source ARN. A missing resource policy is a common cause when using cross-account or cross-region invocations.

To verify, use AWS CLI: aws lambda get-policy --function-name my-function and check the policy statement. For the execution role, check IAM: aws iam simulate-principal-policy --policy-source-arn <role-arn> --action-names lambda:InvokeFunction --resource-arns <function-arn>.

Frequently asked questions

Why does my API Gateway return 502 even though my Lambda function works fine when tested in the console?

The Lambda console test invokes the function directly with a mock event, bypassing API Gateway's integration. A 502 from API Gateway often means the integration response mapping is misconfigured, or the function's response format doesn't match what API Gateway expects (e.g., wrong JSON structure for proxy integrations). Also check the execution role permissions and timeout settings – API Gateway has a 29-second limit, while console tests have a longer timeout.

How do I differentiate between a backend timeout and a network connectivity issue?

Check CloudWatch IntegrationLatency metric: if it's close to 29-30 seconds, it's a timeout (backend is slow). If it's low (under 1 second), the backend likely returned a non-2xx status or malformed response. Also check the NLB target health and VPC Flow Logs – if traffic is rejected, it's a network/security group issue.

Can CORS cause a 502 error in API Gateway?

CORS itself doesn't cause a 502, but if the OPTIONS method is not configured correctly (e.g., missing integration), API Gateway can return a 502 for the preflight request. This is often misdiagnosed as a CORS issue. Ensure the OPTIONS method has a mock integration or a backend that responds with the appropriate CORS headers.

What does 'Execution failed due to integration error' mean in CloudWatch logs?

This generic message means API Gateway could not complete the integration request. The next line usually provides more detail: 'Endpoint request timed out' (timeout), 'Network error' (connectivity), or 'Invalid response' (backend response parsing failed). Always look at the full log stream for the specific error.

How can I test API Gateway with VPC Link without hitting production?

Create a separate stage (e.g., 'test') with its own VPC Link endpoint and backend. You can also use the API Gateway 'test' feature in the console, which allows you to invoke the method directly without deploying. However, the test feature does not use VPC Link – it sends traffic from the API Gateway service itself. For VPC Link testing, you must deploy to a stage and send requests via the public endpoint.

Debugging AWS API Gateway 502 Bad Gateway Errors

What this usually means

Frequently asked questions