Managing Multi-Account AWS Infrastructure with Terraform Workspaces#

When you’re managing infrastructure across dozens of AWS accounts, you need patterns that scale. In this post I’ll share the approach I use to manage multi-account, multi-environment AWS infrastructure using Terraform workspaces, modular code, and a consistent tagging strategy.

The Problem#

Imagine this setup: you have multiple organizational scopes (teams, business units, projects), each with their own AWS accounts for non-production and production. On top of that, your non-production account hosts multiple environments (dev, integration, certification). Multiply this by several countries or regions, and you’re looking at a lot of infrastructure to manage.

The naive approach of copy-pasting Terraform code per environment quickly becomes unmaintainable. You need a strategy that lets you define infrastructure once and deploy it consistently across all environments.

Workspace-Based Environment Separation#

Terraform workspaces are the foundation of this approach. Each workspace maps to an environment tier:

# terraform.tf
terraform {
  required_version = ">= 1.5"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {}
}

We use partial backend configuration with .hcl files per environment:

# backend-qual.hcl
bucket         = "my-scope-terraform-state-qual"
key            = "my-service/terraform.tfstate"
region         = "eu-central-1"
dynamodb_table = "terraform-lock"
encrypt        = true
# backend-prod.hcl
bucket         = "my-scope-terraform-state-prod"
key            = "my-service/terraform.tfstate"
region         = "eu-central-1"
dynamodb_table = "terraform-lock"
encrypt        = true

Initialize with the appropriate backend:

terraform init -backend-config=backend-qual.hcl
terraform workspace select qual || terraform workspace new qual

terraform init -backend-config=backend-prod.hcl
terraform workspace select prod || terraform workspace new prod

Environment-Specific Variables with Lookup Maps#

Instead of separate .tfvars files, I use lookup maps keyed by terraform.workspace. This keeps everything in one place and makes differences between environments immediately visible:

# locals.tf
locals {
  environment = terraform.workspace

  # ECS configuration per environment
  ecs_cpu = lookup({
    qual = 512
    prod = 1024
  }, terraform.workspace, 512)

  ecs_memory = lookup({
    qual = 1024
    prod = 2048
  }, terraform.workspace, 1024)

  # Aurora Serverless v2 scaling
  aurora_min_acu = lookup({
    qual = 0.5
    prod = 1
  }, terraform.workspace, 0.5)

  aurora_max_acu = lookup({
    qual = 2
    prod = 4
  }, terraform.workspace, 2)

  # Fargate capacity provider strategy
  fargate_spot_weight = lookup({
    qual = 4
    prod = 1
  }, terraform.workspace, 4)

  fargate_ondemand_weight = lookup({
    qual = 1
    prod = 4
  }, terraform.workspace, 1)

  # Common tags applied to all resources
  common_tags = {
    Environment = local.environment
    Project     = var.project_name
    ManagedBy   = "terraform"
    Team        = "platform"
  }
}

This pattern makes it easy to see at a glance how environments differ. Non-production gets smaller instances and more Spot capacity; production gets larger instances and more On-Demand stability.

Modular Infrastructure#

Each infrastructure concern lives in its own module:

terraform/
  modules/
    alb/
    aurora/
    cloudwatch/
    ecr/
    ecs/
    route53/
    security-groups/
    waf/
  main.tf
  locals.tf
  terraform.tf
  backend-qual.hcl
  backend-prod.hcl

The root module composes them:

# main.tf
module "ecr" {
  source       = "./modules/ecr"
  service_name = var.service_name
  tags         = local.common_tags
}

module "security_groups" {
  source   = "./modules/security-groups"
  vpc_id   = data.aws_vpc.main.id
  vpc_cidr = data.aws_vpc.main.cidr_block
  tags     = local.common_tags
}

module "alb" {
  source            = "./modules/alb"
  service_name      = var.service_name
  vpc_id            = data.aws_vpc.main.id
  subnet_ids        = data.aws_subnets.private.ids
  security_group_id = module.security_groups.alb_sg_id
  certificate_arn   = data.aws_acm_certificate.main.arn
  tags              = local.common_tags
}

module "ecs" {
  source              = "./modules/ecs"
  service_name        = var.service_name
  cpu                 = local.ecs_cpu
  memory              = local.ecs_memory
  image               = "${module.ecr.repository_url}:latest"
  target_group_arn    = module.alb.target_group_arn
  security_group_id   = module.security_groups.ecs_sg_id
  subnet_ids          = data.aws_subnets.private.ids
  spot_weight         = local.fargate_spot_weight
  ondemand_weight     = local.fargate_ondemand_weight
  tags                = local.common_tags
}

module "aurora" {
  source            = "./modules/aurora"
  cluster_name      = "${var.service_name}-${local.environment}"
  engine_version    = "16.6"
  min_acu           = local.aurora_min_acu
  max_acu           = local.aurora_max_acu
  vpc_id            = data.aws_vpc.main.id
  subnet_ids        = data.aws_subnets.database.ids
  security_group_id = module.security_groups.aurora_sg_id
  tags              = local.common_tags
}

The Three-Tier Security Group Pattern#

Every service follows the same layered security model:

# modules/security-groups/main.tf

resource "aws_security_group" "alb" {
  name_prefix = "${var.service_name}-alb-"
  vpc_id      = var.vpc_id

  ingress {
    description = "HTTPS from VPC"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [var.vpc_cidr]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = var.tags
}

resource "aws_security_group" "ecs" {
  name_prefix = "${var.service_name}-ecs-"
  vpc_id      = var.vpc_id

  ingress {
    description     = "Traffic from ALB"
    from_port       = var.container_port
    to_port         = var.container_port
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = var.tags
}

resource "aws_security_group" "aurora" {
  name_prefix = "${var.service_name}-aurora-"
  vpc_id      = var.vpc_id

  ingress {
    description     = "PostgreSQL from ECS"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.ecs.id]
  }

  tags = var.tags
}

The key principle: each layer only accepts traffic from the layer above it. The ALB accepts HTTPS from the VPC, ECS accepts traffic only from the ALB security group, and Aurora accepts connections only from the ECS security group. No hardcoded CIDRs between tiers.

IAM with Permissions Boundaries#

In an enterprise multi-account setup, you typically have a governance layer that constrains what each scope can do. Permissions boundaries are the mechanism:

resource "aws_iam_role" "ecs_task" {
  name = "${var.service_name}-ecs-task-${local.environment}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ecs-tasks.amazonaws.com"
      }
    }]
  })

  permissions_boundary = data.aws_iam_policy.scope_boundary.arn
  tags                 = local.common_tags
}

Every IAM role gets the scope’s permissions boundary attached. This ensures that even if a role policy is overly permissive, it can’t exceed what the organizational scope allows. The boundary is managed by a central governance team, not by individual project teams.

WAF for Rate Limiting#

Every ALB gets a WAF with at least rate limiting and the AWS managed common rule set:

# modules/waf/main.tf

resource "aws_wafv2_web_acl" "main" {
  name  = "${var.service_name}-waf-${var.environment}"
  scope = "REGIONAL"

  default_action {
    allow {}
  }

  rule {
    name     = "rate-limit"
    priority = 1

    action {
      block {}
    }

    statement {
      rate_based_statement {
        limit              = var.rate_limit
        aggregate_key_type = "IP"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "${var.service_name}-rate-limit"
      sampled_requests_enabled   = true
    }
  }

  rule {
    name     = "aws-common-rules"
    priority = 2

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesCommonRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "${var.service_name}-common-rules"
      sampled_requests_enabled   = true
    }
  }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "${var.service_name}-waf"
    sampled_requests_enabled   = true
  }

  tags = var.tags
}

resource "aws_wafv2_web_acl_association" "main" {
  resource_arn = var.alb_arn
  web_acl_arn  = aws_wafv2_web_acl.main.arn
}

CI/CD Integration#

The GitLab CI pipeline follows a promotion flow. A commit to master triggers a plan; merging to release triggers apply:

stages:
  - init
  - plan
  - apply

variables:
  TF_PLUGIN_CACHE_DIR: "$CI_PROJECT_DIR/.terraform-plugins"

cache:
  key: terraform-plugins
  paths:
    - .terraform-plugins/

.terraform_base:
  image: hashicorp/terraform:1.5
  before_script:
    - terraform init -backend-config=backend-${ENVIRONMENT}.hcl
    - terraform workspace select ${ENVIRONMENT} || terraform workspace new ${ENVIRONMENT}

plan:qual:
  extends: .terraform_base
  stage: plan
  variables:
    ENVIRONMENT: qual
  script:
    - terraform plan -out=plan.tfplan
  artifacts:
    paths:
      - plan.tfplan
  rules:
    - if: $CI_COMMIT_BRANCH == "master"

apply:qual:
  extends: .terraform_base
  stage: apply
  variables:
    ENVIRONMENT: qual
  script:
    - terraform apply plan.tfplan
  dependencies:
    - plan:qual
  rules:
    - if: $CI_COMMIT_BRANCH == "release"
      when: manual

plan:prod:
  extends: .terraform_base
  stage: plan
  variables:
    ENVIRONMENT: prod
  script:
    - terraform plan -out=plan.tfplan
  artifacts:
    paths:
      - plan.tfplan
  rules:
    - if: $CI_COMMIT_BRANCH == "release"

apply:prod:
  extends: .terraform_base
  stage: apply
  variables:
    ENVIRONMENT: prod
  script:
    - terraform apply plan.tfplan
  dependencies:
    - plan:prod
  rules:
    - if: $CI_COMMIT_BRANCH == "release"
      when: manual

Caching the Terraform plugin directory significantly speeds up pipeline runs when you have large provider downloads.

Lessons Learned#

After managing this pattern across many scopes and projects, here’s what I’ve found works well:

  1. Workspaces over directories. Having separate directories per environment leads to drift. Workspaces with lookup maps keep a single source of truth.

  2. Modules with opinions. Each module should embed best practices (deployment circuit breakers, Container Insights, log retention policies) rather than exposing every knob. If 90% of services need the same config, make it the default.

  3. Tag everything. Consistent tagging across all resources is what makes cost allocation, compliance reporting, and automated cleanup possible at scale.

  4. Permissions boundaries are non-negotiable. In a multi-team enterprise, you need guardrails. Permissions boundaries let teams self-serve within safe limits.

  5. Plan before apply, always. Even in non-production. A Terraform plan that shows 47 resources being destroyed is a lot cheaper to review than to recover from.

References#