Skip to main content

I Built a Production Infrastructure Template With Claude Code in One Session

Table of Contents
In a single session, I used Claude Code to build, deploy, break, fix, security-test, and iterate on a production-grade infrastructure template that generates fully configured AWS Fargate and Lambda projects. This is not a story about AI writing boilerplate. It is about using AI as an engineering partner across 30+ iterations of real infrastructure hitting real AWS accounts.

I run a platform engineering team. Every new delivery project starts the same way: someone copies an existing repo, spends a day ripping out the old application code, renaming things, updating Terraform locals, misconfiguring IAM permissions, forgetting to seed secrets, and finally getting a container that starts. Then they spend another day on CI/CD.

I wanted a template that generates all of this automatically. One command, answer a few questions, get a deployable project. The template needed to support Fargate and Lambda, optional ALB exposure, optional Aurora PostgreSQL, optional S3 buckets, EntraID SSO, and auto-discovery of AWS networking resources from the CLI profile.

I built the whole thing with Claude Code in one extended session. Here is how it actually went.

How I Used Claude Code
#

The session lasted several hours and covered roughly 30 commits across two repositories. The interaction pattern was not “write me a template.” It was iterative, production-driven, and often adversarial. I would deploy, hit an error, paste it back, and expect a fix.

The Investigation Phase
#

I started by pointing Claude Code at two existing delivery repositories in our GitLab instance. I asked it to analyze the patterns: CI/CD structure, Docker build approach, Terraform organization, authentication flow, secret management.

Claude Code fetched every relevant file via the GitLab API, analyzed them in parallel using subagents, and produced a structured summary of patterns across both repos. This took about two minutes. Doing it manually would have taken an hour of reading.

What I asked
# Paraphrasing the actual prompt:
"Analyze these two reference repos. Fetch .gitlab-ci.yml, Dockerfile,
Makefile, terraform files, start.sh, CLAUDE.md. Extract patterns for
CI/CD, Docker, auth, secrets, infrastructure."

The AI launched two parallel research agents, each fetching 10+ files via API, and returned a consolidated analysis. This set the foundation for everything that followed.

The Build-Deploy-Break-Fix Loop
#

The most effective pattern was not planning. It was deploying to a real AWS account and letting reality tell us what was wrong.

sequenceDiagram
    participant Me
    participant Claude
    participant AWS
    Me->>Claude: generate project from template
    Claude-->>Me: project created
    Me->>AWS: make tf-apply
    AWS-->>Me: error: duplicate tag keys
    Me->>Claude: paste error
    Claude-->>Me: fix: provider default_tags overlap with resource tags
    Me->>AWS: make tf-apply
    AWS-->>Me: error: IAM CreateRole AccessDenied
    Me->>Claude: paste error, check existing role names
    Claude->>AWS: aws iam list-roles --profile ...
    Claude-->>Me: fix: IAM roles need uppercase scope prefix
    Me->>AWS: make tf-apply
    AWS-->>Me: deployed, but ECS task fails
    Me->>Claude: check logs
    Claude->>AWS: aws logs tail ...
    Claude-->>Me: secret has no value, seeding placeholder
    Me->>AWS: tasks running

This loop happened about 15 times. Each cycle took 2-5 minutes. Claude Code would read the error, understand the context (because it had written the Terraform), propose and apply the fix, then I would redeploy.

What I Directed vs What Claude Decided
#

I gave high-level directions. Claude made implementation decisions.

I said: “Fargate SPOT should be the default.” Claude decided: Set FARGATE_SPOT weight=100 base=0, FARGATE weight=1 base=0 as fallback. Added deployment_circuit_breaker with rollback, stopTimeout=120 to match the SPOT 2-minute reclamation window, deregistration_delay=30 for fast draining.

I said: “We should always ask if we want to expose it with a standard URL.” Claude decided: Created a modular alb.tf.jinja with count-based resources (ALB, target group, Route53 record, ACM certificate lookup, HTTP-to-HTTPS redirect), only included when expose_url=true. Wired CORS_ORIGIN from the Route53 URL into the Fargate task environment.

I said: “Is uvicorn reading how many CPUs are available?” Claude explained the os.cpu_count() problem on Fargate (returns host CPUs, not allocated units), then fixed it by passing FARGATE_CPU as an environment variable and calculating workers from that. I did not need to specify the fix.

The Migration Moments
#

The session included two significant architectural pivots:

  1. Cookiecutter to Copier. I asked “can we avoid the generate.sh wrapper if we switch to copier?” Claude analyzed what generate.sh did, mapped each responsibility to copier’s native features (typed questions, conditional prompts, post-tasks), and concluded yes, the wrapper could be eliminated. The migration touched 59 files but took about 15 minutes.

  2. Concrete repo to template. Davis started as a hello-world Fargate app, then became a cookiecutter template in a separate repo, then I said “Davis IS the template.” Claude moved everything in one commit, cleaned up references, and the concrete app became the copier template repo.

What Claude Code Actually Did
#

AWS Resource Discovery
#

The post-generation script discovers networking resources from the AWS profile. Claude wrote the discovery logic that:

  • Calls aws sts get-caller-identity for account ID
  • Lists VPCs and presents a menu if multiple exist (no hardcoded VPC naming assumptions)
  • Finds subnets by Name tag pattern (*PRIVATE_1, *NAT_1)
  • Finds security groups by name pattern (*-default, *-private)
  • Patches the discovered values into locals.tf and .env
VPC discovery (from post-generate.py)
vpcs = json.loads(vpcs_json)
if len(vpcs) == 1:
    vpc = vpcs[0]
else:
    print("Multiple VPCs found:")
    for i, v in enumerate(vpcs):
        print(f"  {i+1}) {v.get('Name', 'unnamed')} ({v['VpcId']})")
    choice = input("Select VPC [1]: ").strip() or "1"
    vpc = vpcs[int(choice) - 1]

Modular Terraform
#

The template generates only the Terraform files you need:

Lambda + S3, no ALB, no Aurora
infrastructure/terraform/
  main.tf        # ECR, IAM, CloudWatch (always)
  lambda.tf      # Lambda function
  s3.tf          # S3 buckets
  locals.tf      # Discovered AWS resources
  secrets.tf     # Only if Aurora selected
  # fargate.tf   -- not generated
  # alb.tf       -- not generated
  # aurora.tf    -- not generated

The post-generation script removes files based on the user’s choices. Terraform never sees resources it should not manage.

Security Assessment and Fixes
#

Late in the session, I asked Claude to perform a basic security assessment against the deployed app. It ran 20 tests:

VAPT test examples
# CORS from evil origin
curl -sI -H "Origin: https://evil.com" https://app.example.com/api/v1/hello

# JWT alg:none attack
curl -s -H "Authorization: Bearer eyJhbGciOiJub25l..." https://app.example.com/api/v1/me

# Swagger docs exposure
curl -s -o /dev/null -w "%{http_code}" https://app.example.com/api/docs

It found three issues and fixed all of them:

FindingFix
CORS * with credentialsRestricted to same-origin, configurable via CORS_ORIGIN env var
Swagger/OpenAPI docs exposedDisabled in non-local environments (DEPLOY_ENVIRONMENT != local)
Nginx version in headersAdded server_tokens off

After redeployment, it re-ran the same tests to verify the fixes.

What Worked Well
#

Tip

The most effective pattern was deploying to real infrastructure immediately and using error messages as prompts. Claude Code’s context window retained the full Terraform it had written, so it could diagnose errors instantly.

Parallel research with subagents
Launching two research agents to analyze reference repos simultaneously cut the investigation time from an hour to two minutes. Each agent fetched and analyzed 10+ files independently.
Error-driven iteration
Pasting AWS error messages directly into the conversation was the fastest feedback loop. Claude had written the Terraform, so it understood the context without re-reading files. Fix, commit, push, deploy, repeat.
Scope derivation from profile name
Instead of asking users for scope, environment, and AWS account separately, Claude derived everything from the AWS profile name (gapmgm-italy-qual splits into scope=gapmgm, country=italy, env=qual). One input, three values. It even handled compound environments like dev-ca.
Security testing in the same session
Having the same AI that wrote the code also attack it meant it knew exactly where to look. It tested its own JWT validation, its own CORS config, its own nginx setup. And it knew how to fix what it found.

What Didn’t Work / What I Had to Correct
#

Note

Claude Code needed correction most often on organizational decisions, not technical ones. It would implement things correctly but in the wrong place, or with assumptions that did not match the team’s workflow.

Template as separate repo (wrong)
Claude created the template as a separate repo under blueprints/. I had to tell it: “Davis IS the template.” The concrete example and the template should be the same thing. This was a product decision the AI could not have known.
Hardcoded 'qual' everywhere
The first version hardcoded qual in variable names (aws_profile_qual), file names (qual.yml), and defaults. I had to push back multiple times: “qual is extracted from the profile, no hardcoded references.” The AI kept defaulting to the reference repo’s conventions.
Vite build-time env vars for runtime config
The EntraID client ID was set via VITE_ENTRA_CLIENT_ID, which gets baked into the JavaScript bundle at build time. The Docker build had no access to these values. Claude’s fix (a /api/v1/config endpoint returning public config at runtime) was good, but it should not have used build-time env vars in the first place.
Unnecessary secret for PKCE flow
Claude included an entra_client_secret in Secrets Manager. I had to ask: “do we actually need this?” The answer was no. MSAL with PKCE is a public client flow. The backend validates ID tokens against Microsoft’s public JWKS keys. No secret needed. Removing it eliminated the secret-seeding problem entirely.
ECS not redeploying with new images
The most persistent issue. Docker images were built and pushed, but ECS kept running the old version. The root cause: task definition did not change (same :latest tag, same env vars), so terraform saw no diff. The fix was injecting BUILD_ID = null_resource.build[0].id into the container environment, forcing a new task definition revision on every apply.

The Result
#

In one session, the template went from zero to production-tested across two AWS accounts. The final state:

  • One command generates a fully configured project: copier copy --trust git+ssh://...davis.git ./my-app
  • Modular infrastructure: Fargate or Lambda, optional ALB/Route53/SSL, optional Aurora, optional S3, optional EntraID SSO
  • AWS auto-discovery: VPC, subnets, security groups, account ID, ECR registry. All from the profile name.
  • Branch-per-account delivery: push to gapmgm-italy-qual branch deploys to that account. Push to gapmgm-italy-prod deploys to prod. Same pipeline, different targets.
  • Security-tested: CORS restricted, docs hidden in non-local, nginx version suppressed, JWT validation verified against fake tokens and alg:none attacks
  • Auto-scaling: target-tracking on CPU, min 1 task, max configurable (default 5), 15-second scale-out cooldown

The session produced roughly 60 files across the template, covering Terraform, Python (FastAPI), TypeScript (React), Docker, nginx, GitLab CI, and operational scripts.

Based on past experience, building a template like this manually takes about two weeks:

  • Week 1: Analyze reference repos, design template structure, write Terraform modules, write application scaffolding, set up CI/CD
  • Week 2: Test across accounts, fix IAM permissions, fix networking, fix secret management, security review, documentation

And it would still need iteration after the first team tries to use it.

One extended session. Real deployments to real accounts. 30+ fix iterations. Security testing included. Documentation generated. Template tested with multiple configurations (Fargate full-stack, Lambda minimal, Fargate backend-only).

The key was not speed of code generation. It was speed of iteration. Deploy, break, fix, deploy. Each cycle took minutes instead of hours because the AI retained full context of everything it had built.

Key Takeaway
#

The most valuable thing about using Claude Code for infrastructure work is not that it writes Terraform. It is that it can hold the entire system in context, across Terraform, Docker, Python, TypeScript, nginx, and CI/CD, and diagnose cross-cutting issues instantly.

When an ECS task failed because a Secrets Manager secret had no value, Claude did not just see the error. It understood that Terraform creates the secret resource but not the secret version, that the task definition references the secret ARN, and that the fix was adding an aws_secretsmanager_secret_version with a placeholder and lifecycle { ignore_changes }. That chain of reasoning across three files and two AWS services is where the real value is.

The pattern that works: give high-level direction, deploy to real infrastructure early, use errors as prompts, and let the AI iterate. Do not try to get it perfect in planning. Get it deployed and let production tell you what is wrong.


If you want to go deeper on any of this, I offer 1:1 coaching sessions for engineers working on AI integration, cloud architecture, and platform engineering. Book a session (50 EUR / 60 min) or reach out at manuel.fedele+website@gmail.com.

Related