AWS ECS Task Not Starting: Definition Error Debug

What this usually means

The task definition is syntactically valid—you didn't get a schema validation error—but the runtime cannot fulfill one of its requirements. The most common hidden causes are: (1) the container image URI points to a non-existent tag or an ECR repo the task role can't access, (2) the task execution role lacks `ecr:GetAuthorizationToken` and `ecr:BatchGetImage`, (3) memory or CPU requested exceeds the container instance's remaining capacity, or (4) a required secret in AWS Secrets Manager is referenced but the role can't read it. ECS doesn't surface these in the task definition validation; they only appear as runtime stop reasons.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Run `aws ecs describe-tasks --cluster <cluster> --tasks <taskArn>` and examine the `stopCode` and `stoppedReason` fields.
2Check the task definition's `executionRoleArn` and ensure it includes the `AmazonECSTaskExecutionRolePolicy` managed policy (or equivalent custom policy).
3Verify the image URI in the task definition: `aws ecs describe-task-definition --task-definition <family>:<revision> | jq '.taskDefinition.containerDefinitions[0].image'`
4Attempt to pull the image locally with `docker pull <imageUri>` to rule out network or auth issues.
5Review CloudWatch logs for the `ecs-agent` on the container instance (SSH or SSM) at `/var/log/ecs/ecs-agent.log.*` for `CannotPullContainerError` details.

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

searchECS task stop reason: `aws ecs describe-tasks --cluster ... --tasks ...` output
searchTask definition JSON: `aws ecs describe-task-definition --task-definition family:revision`
searchExecution role IAM policy: verify `ecr:GetAuthorizationToken` and `ecr:BatchGetImage` are allowed
searchContainer instance ECS agent logs: `/var/log/ecs/ecs-agent.log*` on the host
searchCloudWatch log group for the task (if defined) – empty group often means container never started
searchECS service events: `aws ecs describe-services --cluster ... --services ...` events field
searchAWS CloudTrail for `RunTask` or `CreateService` API calls to see who registered the task definition

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningTask execution role missing ECR pull permissions (most common)
warningContainer image tag doesn't exist in ECR or Docker Hub
warningTask definition memory hard limit is set below image minimum requirement (e.g., Alpine needs ~128 MB, Java app needs 512 MB+)
warningVPC/subnet configuration prevents ECS agent from reaching ECR (no NAT gateway or VPC endpoints)
warningSecrets Manager reference in task definition but execution role lacks `secretsmanager:GetSecretValue`
warningContainer port mapping conflicts with existing host port (bridge networking mode)
warningTask definition uses `awsvpc` network mode but security group doesn't allow outbound internet (for public images)

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildAttach `arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy` to the execution role
buildUpdate the task definition image URI to use an existing tag: `aws ecs describe-task-definition --task-definition ...` then register new revision with correct URI
buildIncrease memory and CPU in task definition: set `memory` at least 256 for simple images, or use `memoryReservation` for soft limits
buildAdd VPC endpoints for ECR API and ECR DKR (or ensure NAT gateway exists in the subnet route table)
buildIf using Secrets Manager, add `secretsmanager:GetSecretValue` to execution role with resource ARN constraint
buildFor bridge networking, change container port mapping to use host port 0 (dynamic port) to avoid conflicts

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedRun `aws ecs run-task --cluster ... --task-definition ...` and immediately `describe-tasks` until `lastStatus` becomes `RUNNING`
verifiedCheck CloudWatch logs for the container's stdout/stderr within 30 seconds of task start
verifiedVerify the task can be reached on its assigned IP/port (if `awsvpc` network mode)
verifiedSimulate the fix by doing a `docker pull` with the credentials from the execution role (use `aws ecr get-login-password` then `docker login`)
verifiedMonitor ECS service events: `aws ecs describe-services --cluster ... --services ...` events list should show no new errors
verifiedReduce memory limit to 128 MB and confirm task fails with `OutOfMemoryError` (negative test) then restore correct value

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningDon't assume the task definition is correct because the API accepted it—runtime validation is separate
warningDon't forget that execution role and task role are different: execution role pulls images, task role is for container code
warningDon't use a wildcard tag like `latest` in production; pin to a specific digest or tag that you've verified exists
warningDon't overlook that ECR repositories are region-specific—the image URI must match the task's region
warningDon't waste time on the container definition's `entryPoint` or `command` until you've confirmed the image pulls

( 07 )War story

The Phantom Image Pull

Platform EngineerAWS ECS (Fargate), ECR, Python Flask app, GitHub Actions CI

Timeline

09:15Deploy pipeline pushes new image tag 'v2.1.0' to ECR
09:17Pipeline triggers ECS service update with new task definition revision
09:20Service events show 'service unable to consistently start tasks' – 3 tasks stopped
09:22I run `aws ecs describe-tasks --cluster prod --tasks arn:...` – stop code 'CannotPullContainerError'
09:25Check image URI in task definition: `123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:v2.1.0`
09:28Log into ECR and try `docker pull 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:v2.1.0` – succeeds locally
09:30Inspect execution role – it's a custom role with only `ecr:BatchGetImage` and no `ecr:GetAuthorizationToken`
09:32Attach AmazonECSTaskExecutionRolePolicy to the role
09:34Redeploy service – new tasks become RUNNING within 30 seconds

I was on-call when the pager went off at 09:20: 'ECS service prod-api has 3 failed tasks.' The pipeline had pushed a new image tag and deployed a fresh task definition. My first instinct was to check the task stop reasons. `describe-tasks` gave me `CannotPullContainerError` – classic image pull failure. I immediately thought the image tag didn't exist in ECR, so I ran `docker pull` with the exact URI from the task definition. It worked fine locally. That ruled out a missing tag or ECR repo issue.

Next I looked at the execution role. The task definition referenced a custom role named `ecsExecutionRoleProd`. I pulled the IAM policy and saw only `ecr:BatchGetImage` and `ecr:GetDownloadUrlForLayer` – but critically, no `ecr:GetAuthorizationToken`. The ECS agent needs that permission to authenticate with ECR before pulling. Without it, the pull request is denied even though the image exists. That was the bug.

The fix was trivial: attach the AWS managed `AmazonECSTaskExecutionRolePolicy` to the role. That policy includes `ecr:GetAuthorizationToken`, `ecr:BatchCheckLayerAvailability`, `ecr:GetDownloadUrlForLayer`, and `ecr:BatchGetImage`. I redeployed the service and within 30 seconds the tasks were running. The lesson: always verify the execution role has the complete set of ECR pull permissions, not just a subset. And never assume a custom role is sufficient without checking the exact API calls the ECS agent makes.

Root cause

Task execution role missing `ecr:GetAuthorizationToken` permission, causing ECS agent to fail authentication when pulling from ECR.

The fix

Attached the managed policy `AmazonECSTaskExecutionRolePolicy` to the execution role.

The lesson

Always verify the execution role includes all required ECR permissions, especially `ecr:GetAuthorizationToken`. Use the managed policy unless you have a specific reason to create a custom one—and if you do, test the pull with the same IAM context.

( 08 )How ECS Authenticates and Pulls Container Images

When ECS launches a task, the ECS agent on the container instance (or the Fargate agent) needs to pull the container image. For ECR repositories, this requires a multi-step authentication process. First, the agent calls `ecr:GetAuthorizationToken` to get a base64-encoded token. It then uses that token to log in to the ECR registry. After login, it calls `ecr:BatchCheckLayerAvailability` and `ecr:BatchGetImage` to fetch the image manifest and layers.

The execution role must have permissions for all these actions. The managed policy `AmazonECSTaskExecutionRolePolicy` includes them all. A custom role that only includes `ecr:BatchGetImage` will fail at the first step—`GetAuthorizationToken`—because the agent has no credentials to authenticate. The error message in `describe-tasks` will be `CannotPullContainerError`, which is generic. You must inspect the ECS agent logs on the instance to see the specific HTTP 401 or 403 response from ECR.

For public images from Docker Hub, no ECR permissions are needed, but the instance must have outbound internet access (via NAT gateway or internet gateway). For Fargate tasks, the VPC must have a NAT gateway or VPC endpoints for ECR API and DKR. Missing network connectivity is another common sub-cause that manifests as the same `CannotPullContainerError`.

( 09 )Memory and CPU Limits: The Silent Task Killer

ECS task definitions have both a `memory` hard limit (container can't exceed this) and a `memoryReservation` soft limit (minimum). If the hard limit is set too low, the container will be killed with an `OutOfMemoryError` stop code. This often happens with Java applications where the JVM heap alone can exceed 256 MB, but the task definition only allocates 128 MB.

The tricky part is that ECS doesn't validate memory against the image's actual requirements at registration time. The task will start, run for a few seconds, then stop. To debug, look for `OOMKilled` in the stop reason or check CloudWatch logs for abrupt termination. Use `docker inspect` on the image to see its default memory usage, or run locally with `docker stats` to measure peak consumption. Then set `memory` to at least that peak plus a buffer (e.g., 512 MB for a typical Java microservice).

CPU limits work similarly. If you set `cpu` to 256 (0.25 vCPU) but the image is CPU-bound, the container may become unresponsive but not necessarily stop. The task stays RUNNING but health checks fail. Always ensure the CPU allocation is realistic for your workload.

( 10 )Secrets and Environment Variables from AWS Secrets Manager

ECS task definitions can reference secrets from AWS Secrets Manager or Parameter Store using the `secrets` array in the container definition. If the execution role lacks `secretsmanager:GetSecretValue` (or `ssm:GetParameter`), the container will fail to start with `ResourceInitializationError`.

I've seen cases where the task definition references a secret by ARN, but the execution role policy is scoped to a specific secret ARN that doesn't match (e.g., a wildcard in the wrong place). The error message will say something like 'cannot retrieve secret' but won't tell you which secret. The fix is to review the IAM policy's resource block and ensure it covers all referenced secrets. Use the `secretsmanager:ResourceTag` condition if you tag secrets with environment names.

Another gotcha: if the secret is in a different region, the task definition must include the full ARN with region. ECS doesn't cross-region resolve secrets. The error will be a timeout trying to reach the secret endpoint.

( 11 )Network Configuration and VPC Endpoints for Fargate

Fargate tasks use `awsvpc` network mode, meaning each task gets its own elastic network interface (ENI). The task must be able to reach ECR and CloudWatch Logs (if logging is configured). If the VPC subnet has no route to the internet (no NAT gateway or internet gateway), the task will fail to pull images or send logs. The error will be a timeout in the ECS agent logs.

The proper fix is to either (1) create VPC endpoints for `com.amazonaws.region.ecr.api`, `com.amazonaws.region.ecr.dkr`, and `com.amazonaws.region.logs` in the same VPC, or (2) ensure the subnet routes to a NAT gateway. VPC endpoints are preferred for security and cost. Without them, tasks in private subnets cannot pull images.

A common misconfiguration is creating VPC endpoints but forgetting to add the route table association. The endpoint will appear active but the subnet won't use it. Always verify the route table has a route to the endpoint (e.g., for `com.amazonaws.us-east-1.ecr.dkr`).

( 12 )Task Definition Registration vs. Runtime Validation

AWS ECS validates the task definition schema at registration time—it checks that required fields exist and types are correct. However, it does NOT validate that the image URI exists, that referenced secrets exist, or that the execution role has sufficient permissions. This is by design: ECS doesn't make network calls to external services during registration.

This means a task definition can be registered successfully but fail every time it's run. The only way to catch these issues is by actually running a task and examining the stop reason. This is a critical point: always do a test `run-task` after registering a new task definition revision, especially when changing the image, secrets, or roles.

CI/CD pipelines should include a step that runs a short-lived ECS task (with a command that exits immediately) and waits for it to reach STOPPED with a zero exit code. This will catch most definition errors before they hit production.

Frequently asked questions

What's the difference between task execution role and task role?

The execution role is used by the ECS agent to pull container images, access secrets, and send logs to CloudWatch. The task role is assumed by the container's application code to interact with AWS services (e.g., S3, DynamoDB). They are separate IAM roles. A common mistake is giving the task role ECR permissions but forgetting the execution role.

Why does my task fail with 'CannotPullContainerError' even though the image exists in ECR?

Most often, the execution role is missing `ecr:GetAuthorizationToken`. Verify the role has `AmazonECSTaskExecutionRolePolicy` attached. Also check network connectivity: if the task is in a private subnet without a NAT gateway or VPC endpoints, the agent can't reach ECR at all. Run `docker pull` from an instance in the same VPC to test connectivity.

How do I see the exact error message from the ECS agent?

For EC2 launch type, SSH into the container instance and check `/var/log/ecs/ecs-agent.log.*` for log lines containing 'CannotPullContainerError'. For Fargate, you cannot access the underlying host. Instead, use `aws ecs describe-tasks` and examine the `stoppedReason` field, which usually contains the error message from the agent. You can also enable ECS Exec to get a shell into the container after it starts (if it ever starts).

Can a task definition error be caused by a misconfigured security group?

Indirectly, yes. If the task uses `awsvpc` network mode, the security group attached to the task's ENI controls outbound traffic. If it blocks outbound HTTPS (port 443), the ECS agent cannot pull images from ECR or Docker Hub. Ensure the security group has an outbound rule allowing HTTPS to 0.0.0.0/0 (or to the VPC endpoint prefix list).

What does 'ResourceInitializationError' mean?

This error usually occurs when the ECS agent fails to set up the container environment, often due to a missing secret or environment file. Check the task definition's `secrets` array and make sure each secret ARN exists and the execution role has `secretsmanager:GetSecretValue`. Also check if you're referencing a secrets file from S3 with incorrect permissions.

ECS Task Definition Error: Why Your Container Won't Start

What this usually means

Frequently asked questions