ECS Fargate Production Patterns That Actually Work
Table of Contents
ECS Fargate Production Patterns That Actually Work#
I’ve deployed and managed many containerized services on ECS Fargate. Over time, a set of patterns has emerged that I apply consistently to every new service. This post documents those patterns with Terraform examples, covering everything from Fargate Spot strategies to deployment circuit breakers and ARM64 migration.
The Standard Architecture#
Every service I deploy follows the same high-level architecture:
Internet/VPC -> ALB (HTTPS, TLS 1.3) -> ECS Fargate -> Aurora PostgreSQL Serverless v2
|
WAF (rate limiting + AWS managed rules)
Each component has its own security group, with traffic flowing only from the layer above. The ALB sits in private subnets (no public-facing services), and Route53 private hosted zones handle internal DNS.
Fargate Spot Strategy#
Fargate Spot can save up to 70% on compute costs, but you need to handle interruptions. The approach: use a weighted capacity provider strategy that balances cost savings with availability.
resource "aws_ecs_cluster" "main" {
name = "${var.service_name}-${var.environment}"
setting {
name = "containerInsights"
value = "enabled"
}
tags = var.tags
}
resource "aws_ecs_cluster_capacity_providers" "main" {
cluster_name = aws_ecs_cluster.main.name
capacity_providers = ["FARGATE", "FARGATE_SPOT"]
default_capacity_provider_strategy {
capacity_provider = "FARGATE_SPOT"
weight = var.spot_weight
base = 1 # At least 1 task on regular Fargate
}
default_capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = var.ondemand_weight
}
}
The base = 1 on FARGATE ensures you always have at least one task running on On-Demand capacity. This is your safety net during Spot interruptions.
For non-production, I use a 4:1 Spot-to-OnDemand ratio. For production, I flip it to 1:4, prioritizing stability while still getting some Spot savings.
Deployment Circuit Breakers#
ECS deployment circuit breakers automatically roll back failed deployments. Combined with the right health check configuration, they prevent bad deployments from taking down your service:
resource "aws_ecs_service" "main" {
name = var.service_name
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.main.arn
desired_count = var.desired_count
deployment_circuit_breaker {
enable = true
rollback = true
}
deployment_configuration {
maximum_percent = 200
minimum_healthy_percent = 100
}
load_balancer {
target_group_arn = var.target_group_arn
container_name = var.service_name
container_port = var.container_port
}
network_configuration {
subnets = var.subnet_ids
security_groups = [var.security_group_id]
assign_public_ip = false
}
capacity_provider_strategy {
capacity_provider = "FARGATE_SPOT"
weight = var.spot_weight
base = 1
}
capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = var.ondemand_weight
}
tags = var.tags
}
The maximum_percent = 200 with minimum_healthy_percent = 100 means ECS will spin up new tasks before draining old ones (rolling deployment). If the new tasks fail health checks, the circuit breaker kicks in and rolls back.
Health Check Configuration#
Getting health checks right is critical. Too aggressive and you get false positives; too lenient and failed deployments take forever to detect:
resource "aws_lb_target_group" "main" {
name = "${var.service_name}-${var.environment}"
port = var.container_port
protocol = "HTTP"
vpc_id = var.vpc_id
target_type = "ip"
health_check {
enabled = true
path = "/health"
port = "traffic-port"
protocol = "HTTP"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 10
interval = 30
matcher = "200"
}
deregistration_delay = 30
tags = var.tags
}
A few things to note:
deregistration_delay = 30instead of the default 300 seconds. Most applications can drain in-flight requests in 30 seconds, and the shorter delay means faster deployments.healthy_threshold = 2means a task needs only 2 successful health checks to be considered healthy (60 seconds with a 30-second interval).unhealthy_threshold = 3gives tasks 90 seconds of failed health checks before being marked unhealthy.
ALB with Path-Based Routing#
For services with separate frontend and backend containers, path-based routing on a single ALB keeps things simple:
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.main.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = var.certificate_arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.frontend.arn
}
tags = var.tags
}
resource "aws_lb_listener_rule" "api" {
listener_arn = aws_lb_listener.https.arn
priority = 100
action {
type = "forward"
target_group_arn = aws_lb_target_group.backend.arn
}
condition {
path_pattern {
values = ["/api/*"]
}
}
tags = var.tags
}
Note the TLS 1.3 policy. There’s no reason to support older TLS versions for internal services.
Auto-Scaling#
ECS services should auto-scale on both CPU and memory. I use target tracking policies:
resource "aws_appautoscaling_target" "ecs" {
max_capacity = var.max_count
min_capacity = var.min_count
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.main.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource "aws_appautoscaling_policy" "cpu" {
name = "${var.service_name}-cpu-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = 70.0
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
resource "aws_appautoscaling_policy" "memory" {
name = "${var.service_name}-memory-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageMemoryUtilization"
}
target_value = 80.0
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
The asymmetric cooldowns matter: scale_out_cooldown = 60 means the service reacts quickly to load spikes, while scale_in_cooldown = 300 prevents premature scale-down during bursty traffic.
Migrating to ARM64 (Graviton)#
AWS Graviton instances offer ~20% better price-performance than x86. Migrating ECS Fargate tasks to ARM64 is straightforward if your images support it:
# Multi-arch Dockerfile
FROM --platform=$TARGETPLATFORM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Build and push multi-arch images:
docker buildx create --use
docker buildx build \
--platform linux/amd64,linux/arm64 \
-t $ECR_REPO:latest \
--push .
Then update the task definition:
resource "aws_ecs_task_definition" "main" {
family = var.service_name
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = var.cpu
memory = var.memory
execution_role_arn = var.execution_role_arn
task_role_arn = var.task_role_arn
runtime_platform {
operating_system_family = "LINUX"
cpu_architecture = "ARM64"
}
container_definitions = jsonencode([{
name = var.service_name
image = var.image
essential = true
portMappings = [{
containerPort = var.container_port
protocol = "tcp"
}]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = var.log_group_name
"awslogs-region" = var.region
"awslogs-stream-prefix" = var.service_name
}
}
}])
tags = var.tags
}
The key line is cpu_architecture = "ARM64". That’s it. If your Docker image is multi-arch, Fargate pulls the right architecture automatically.
Secrets Management#
Never bake secrets into container images. Use AWS Secrets Manager with ECS native integration:
resource "aws_secretsmanager_secret" "db_credentials" {
name = "${var.service_name}/${var.environment}/db-credentials"
tags = var.tags
}
# In the container definition
container_definitions = jsonencode([{
name = var.service_name
image = var.image
secrets = [
{
name = "DATABASE_URL"
valueFrom = aws_secretsmanager_secret.db_credentials.arn
},
{
name = "DJANGO_SECRET_KEY"
valueFrom = "${aws_secretsmanager_secret.app_secrets.arn}:django_secret_key::"
}
]
}])
ECS injects the secret values as environment variables at task startup. The execution role needs secretsmanager:GetSecretValue permission on the specific secret ARNs.
CloudWatch Logging with Sane Retention#
Every service gets a log group with a retention policy. Keeping logs forever is expensive and rarely useful:
resource "aws_cloudwatch_log_group" "main" {
name = "/ecs/${var.service_name}/${var.environment}"
retention_in_days = 14
tags = var.tags
}
14 days is usually enough for debugging. If you need longer retention for compliance, ship logs to S3 or OpenSearch.
Putting It All Together#
Here’s the complete pattern for a new service:
- ECR repository with lifecycle policy (keep last 10 images)
- ECS cluster with Container Insights enabled
- Task definition with ARM64, proper resource limits, secrets injection
- ECS service with circuit breaker, Spot strategy, auto-scaling
- ALB with HTTPS (TLS 1.3), path-based routing
- WAF with rate limiting and AWS managed rules
- Aurora Serverless v2 with environment-appropriate scaling
- Route53 private hosted zone record
- CloudWatch log group with 14-day retention
- Security groups with three-tier model (ALB -> ECS -> Aurora)
Once you have this as a set of Terraform modules, deploying a new service is just composing the modules with service-specific variables. The infrastructure is consistent, secure, and cost-optimized across all environments.