ECS Fargate Production Patterns That Actually Work#

I’ve deployed and managed many containerized services on ECS Fargate. Over time, a set of patterns has emerged that I apply consistently to every new service. This post documents those patterns with Terraform examples, covering everything from Fargate Spot strategies to deployment circuit breakers and ARM64 migration.

The Standard Architecture#

Every service I deploy follows the same high-level architecture:

Internet/VPC -> ALB (HTTPS, TLS 1.3) -> ECS Fargate -> Aurora PostgreSQL Serverless v2
                 |
                WAF (rate limiting + AWS managed rules)

Each component has its own security group, with traffic flowing only from the layer above. The ALB sits in private subnets (no public-facing services), and Route53 private hosted zones handle internal DNS.

Fargate Spot Strategy#

Fargate Spot can save up to 70% on compute costs, but you need to handle interruptions. The approach: use a weighted capacity provider strategy that balances cost savings with availability.

resource "aws_ecs_cluster" "main" {
  name = "${var.service_name}-${var.environment}"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  tags = var.tags
}

resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name = aws_ecs_cluster.main.name

  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = var.spot_weight
    base              = 1  # At least 1 task on regular Fargate
  }

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = var.ondemand_weight
  }
}

The base = 1 on FARGATE ensures you always have at least one task running on On-Demand capacity. This is your safety net during Spot interruptions.

For non-production, I use a 4:1 Spot-to-OnDemand ratio. For production, I flip it to 1:4, prioritizing stability while still getting some Spot savings.

Deployment Circuit Breakers#

ECS deployment circuit breakers automatically roll back failed deployments. Combined with the right health check configuration, they prevent bad deployments from taking down your service:

resource "aws_ecs_service" "main" {
  name            = var.service_name
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.main.arn
  desired_count   = var.desired_count

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  deployment_configuration {
    maximum_percent         = 200
    minimum_healthy_percent = 100
  }

  load_balancer {
    target_group_arn = var.target_group_arn
    container_name   = var.service_name
    container_port   = var.container_port
  }

  network_configuration {
    subnets          = var.subnet_ids
    security_groups  = [var.security_group_id]
    assign_public_ip = false
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = var.spot_weight
    base              = 1
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = var.ondemand_weight
  }

  tags = var.tags
}

The maximum_percent = 200 with minimum_healthy_percent = 100 means ECS will spin up new tasks before draining old ones (rolling deployment). If the new tasks fail health checks, the circuit breaker kicks in and rolls back.

Health Check Configuration#

Getting health checks right is critical. Too aggressive and you get false positives; too lenient and failed deployments take forever to detect:

resource "aws_lb_target_group" "main" {
  name        = "${var.service_name}-${var.environment}"
  port        = var.container_port
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"

  health_check {
    enabled             = true
    path                = "/health"
    port                = "traffic-port"
    protocol            = "HTTP"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 10
    interval            = 30
    matcher             = "200"
  }

  deregistration_delay = 30

  tags = var.tags
}

A few things to note:

  • deregistration_delay = 30 instead of the default 300 seconds. Most applications can drain in-flight requests in 30 seconds, and the shorter delay means faster deployments.
  • healthy_threshold = 2 means a task needs only 2 successful health checks to be considered healthy (60 seconds with a 30-second interval).
  • unhealthy_threshold = 3 gives tasks 90 seconds of failed health checks before being marked unhealthy.

ALB with Path-Based Routing#

For services with separate frontend and backend containers, path-based routing on a single ALB keeps things simple:

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.main.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = var.certificate_arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.frontend.arn
  }

  tags = var.tags
}

resource "aws_lb_listener_rule" "api" {
  listener_arn = aws_lb_listener.https.arn
  priority     = 100

  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.backend.arn
  }

  condition {
    path_pattern {
      values = ["/api/*"]
    }
  }

  tags = var.tags
}

Note the TLS 1.3 policy. There’s no reason to support older TLS versions for internal services.

Auto-Scaling#

ECS services should auto-scale on both CPU and memory. I use target tracking policies:

resource "aws_appautoscaling_target" "ecs" {
  max_capacity       = var.max_count
  min_capacity       = var.min_count
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.main.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "cpu" {
  name               = "${var.service_name}-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

resource "aws_appautoscaling_policy" "memory" {
  name               = "${var.service_name}-memory-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageMemoryUtilization"
    }
    target_value       = 80.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

The asymmetric cooldowns matter: scale_out_cooldown = 60 means the service reacts quickly to load spikes, while scale_in_cooldown = 300 prevents premature scale-down during bursty traffic.

Migrating to ARM64 (Graviton)#

AWS Graviton instances offer ~20% better price-performance than x86. Migrating ECS Fargate tasks to ARM64 is straightforward if your images support it:

# Multi-arch Dockerfile
FROM --platform=$TARGETPLATFORM python:3.12-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Build and push multi-arch images:

docker buildx create --use
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t $ECR_REPO:latest \
  --push .

Then update the task definition:

resource "aws_ecs_task_definition" "main" {
  family                   = var.service_name
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = var.cpu
  memory                   = var.memory
  execution_role_arn       = var.execution_role_arn
  task_role_arn            = var.task_role_arn

  runtime_platform {
    operating_system_family = "LINUX"
    cpu_architecture        = "ARM64"
  }

  container_definitions = jsonencode([{
    name      = var.service_name
    image     = var.image
    essential = true

    portMappings = [{
      containerPort = var.container_port
      protocol      = "tcp"
    }]

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = var.log_group_name
        "awslogs-region"        = var.region
        "awslogs-stream-prefix" = var.service_name
      }
    }
  }])

  tags = var.tags
}

The key line is cpu_architecture = "ARM64". That’s it. If your Docker image is multi-arch, Fargate pulls the right architecture automatically.

Secrets Management#

Never bake secrets into container images. Use AWS Secrets Manager with ECS native integration:

resource "aws_secretsmanager_secret" "db_credentials" {
  name = "${var.service_name}/${var.environment}/db-credentials"
  tags = var.tags
}

# In the container definition
container_definitions = jsonencode([{
  name  = var.service_name
  image = var.image

  secrets = [
    {
      name      = "DATABASE_URL"
      valueFrom = aws_secretsmanager_secret.db_credentials.arn
    },
    {
      name      = "DJANGO_SECRET_KEY"
      valueFrom = "${aws_secretsmanager_secret.app_secrets.arn}:django_secret_key::"
    }
  ]
}])

ECS injects the secret values as environment variables at task startup. The execution role needs secretsmanager:GetSecretValue permission on the specific secret ARNs.

CloudWatch Logging with Sane Retention#

Every service gets a log group with a retention policy. Keeping logs forever is expensive and rarely useful:

resource "aws_cloudwatch_log_group" "main" {
  name              = "/ecs/${var.service_name}/${var.environment}"
  retention_in_days = 14

  tags = var.tags
}

14 days is usually enough for debugging. If you need longer retention for compliance, ship logs to S3 or OpenSearch.

Putting It All Together#

Here’s the complete pattern for a new service:

  1. ECR repository with lifecycle policy (keep last 10 images)
  2. ECS cluster with Container Insights enabled
  3. Task definition with ARM64, proper resource limits, secrets injection
  4. ECS service with circuit breaker, Spot strategy, auto-scaling
  5. ALB with HTTPS (TLS 1.3), path-based routing
  6. WAF with rate limiting and AWS managed rules
  7. Aurora Serverless v2 with environment-appropriate scaling
  8. Route53 private hosted zone record
  9. CloudWatch log group with 14-day retention
  10. Security groups with three-tier model (ALB -> ECS -> Aurora)

Once you have this as a set of Terraform modules, deploying a new service is just composing the modules with service-specific variables. The infrastructure is consistent, secure, and cost-optimized across all environments.

References#