I run a platform engineering team. Every new delivery project starts the same way: someone copies an existing repo, spends a day ripping out the old application code, renaming things, updating Terraform locals, misconfiguring IAM permissions, forgetting to seed secrets, and finally getting a container that starts. Then they spend another day on CI/CD.
I wanted a template that generates all of this automatically. One command, answer a few questions, get a deployable project. The template needed to support Fargate and Lambda, optional ALB exposure, optional Aurora PostgreSQL, optional S3 buckets, EntraID SSO, and auto-discovery of AWS networking resources from the CLI profile.
I built the whole thing with Claude Code in one extended session. Here is how it actually went.
How I Used Claude Code#
The session lasted several hours and covered roughly 30 commits across two repositories. The interaction pattern was not “write me a template.” It was iterative, production-driven, and often adversarial. I would deploy, hit an error, paste it back, and expect a fix.
The Investigation Phase#
I started by pointing Claude Code at two existing delivery repositories in our GitLab instance. I asked it to analyze the patterns: CI/CD structure, Docker build approach, Terraform organization, authentication flow, secret management.
Claude Code fetched every relevant file via the GitLab API, analyzed them in parallel using subagents, and produced a structured summary of patterns across both repos. This took about two minutes. Doing it manually would have taken an hour of reading.
# Paraphrasing the actual prompt:
"Analyze these two reference repos. Fetch .gitlab-ci.yml, Dockerfile,
Makefile, terraform files, start.sh, CLAUDE.md. Extract patterns for
CI/CD, Docker, auth, secrets, infrastructure."The AI launched two parallel research agents, each fetching 10+ files via API, and returned a consolidated analysis. This set the foundation for everything that followed.
The Build-Deploy-Break-Fix Loop#
The most effective pattern was not planning. It was deploying to a real AWS account and letting reality tell us what was wrong.
sequenceDiagram
participant Me
participant Claude
participant AWS
Me->>Claude: generate project from template
Claude-->>Me: project created
Me->>AWS: make tf-apply
AWS-->>Me: error: duplicate tag keys
Me->>Claude: paste error
Claude-->>Me: fix: provider default_tags overlap with resource tags
Me->>AWS: make tf-apply
AWS-->>Me: error: IAM CreateRole AccessDenied
Me->>Claude: paste error, check existing role names
Claude->>AWS: aws iam list-roles --profile ...
Claude-->>Me: fix: IAM roles need uppercase scope prefix
Me->>AWS: make tf-apply
AWS-->>Me: deployed, but ECS task fails
Me->>Claude: check logs
Claude->>AWS: aws logs tail ...
Claude-->>Me: secret has no value, seeding placeholder
Me->>AWS: tasks running
This loop happened about 15 times. Each cycle took 2-5 minutes. Claude Code would read the error, understand the context (because it had written the Terraform), propose and apply the fix, then I would redeploy.
What I Directed vs What Claude Decided#
I gave high-level directions. Claude made implementation decisions.
I said: “Fargate SPOT should be the default.”
Claude decided: Set FARGATE_SPOT weight=100 base=0, FARGATE weight=1 base=0 as fallback. Added deployment_circuit_breaker with rollback, stopTimeout=120 to match the SPOT 2-minute reclamation window, deregistration_delay=30 for fast draining.
I said: “We should always ask if we want to expose it with a standard URL.”
Claude decided: Created a modular alb.tf.jinja with count-based resources (ALB, target group, Route53 record, ACM certificate lookup, HTTP-to-HTTPS redirect), only included when expose_url=true. Wired CORS_ORIGIN from the Route53 URL into the Fargate task environment.
I said: “Is uvicorn reading how many CPUs are available?”
Claude explained the os.cpu_count() problem on Fargate (returns host CPUs, not allocated units), then fixed it by passing FARGATE_CPU as an environment variable and calculating workers from that. I did not need to specify the fix.
The Migration Moments#
The session included two significant architectural pivots:
Cookiecutter to Copier. I asked “can we avoid the generate.sh wrapper if we switch to copier?” Claude analyzed what generate.sh did, mapped each responsibility to copier’s native features (typed questions, conditional prompts, post-tasks), and concluded yes, the wrapper could be eliminated. The migration touched 59 files but took about 15 minutes.
Concrete repo to template. Davis started as a hello-world Fargate app, then became a cookiecutter template in a separate repo, then I said “Davis IS the template.” Claude moved everything in one commit, cleaned up references, and the concrete app became the copier template repo.
What Claude Code Actually Did#
AWS Resource Discovery#
The post-generation script discovers networking resources from the AWS profile. Claude wrote the discovery logic that:
- Calls
aws sts get-caller-identityfor account ID - Lists VPCs and presents a menu if multiple exist (no hardcoded VPC naming assumptions)
- Finds subnets by Name tag pattern (
*PRIVATE_1,*NAT_1) - Finds security groups by name pattern (
*-default,*-private) - Patches the discovered values into
locals.tfand.env
vpcs = json.loads(vpcs_json)
if len(vpcs) == 1:
vpc = vpcs[0]
else:
print("Multiple VPCs found:")
for i, v in enumerate(vpcs):
print(f" {i+1}) {v.get('Name', 'unnamed')} ({v['VpcId']})")
choice = input("Select VPC [1]: ").strip() or "1"
vpc = vpcs[int(choice) - 1]Modular Terraform#
The template generates only the Terraform files you need:
infrastructure/terraform/
main.tf # ECR, IAM, CloudWatch (always)
lambda.tf # Lambda function
s3.tf # S3 buckets
locals.tf # Discovered AWS resources
secrets.tf # Only if Aurora selected
# fargate.tf -- not generated
# alb.tf -- not generated
# aurora.tf -- not generatedThe post-generation script removes files based on the user’s choices. Terraform never sees resources it should not manage.
Security Assessment and Fixes#
Late in the session, I asked Claude to perform a basic security assessment against the deployed app. It ran 20 tests:
# CORS from evil origin
curl -sI -H "Origin: https://evil.com" https://app.example.com/api/v1/hello
# JWT alg:none attack
curl -s -H "Authorization: Bearer eyJhbGciOiJub25l..." https://app.example.com/api/v1/me
# Swagger docs exposure
curl -s -o /dev/null -w "%{http_code}" https://app.example.com/api/docsIt found three issues and fixed all of them:
| Finding | Fix |
|---|---|
CORS * with credentials | Restricted to same-origin, configurable via CORS_ORIGIN env var |
| Swagger/OpenAPI docs exposed | Disabled in non-local environments (DEPLOY_ENVIRONMENT != local) |
| Nginx version in headers | Added server_tokens off |
After redeployment, it re-ran the same tests to verify the fixes.
What Worked Well#
The most effective pattern was deploying to real infrastructure immediately and using error messages as prompts. Claude Code’s context window retained the full Terraform it had written, so it could diagnose errors instantly.
Parallel research with subagents
Error-driven iteration
Scope derivation from profile name
gapmgm-italy-qual splits into scope=gapmgm, country=italy, env=qual). One input, three values. It even handled compound environments like dev-ca.Security testing in the same session
What Didn’t Work / What I Had to Correct#
Claude Code needed correction most often on organizational decisions, not technical ones. It would implement things correctly but in the wrong place, or with assumptions that did not match the team’s workflow.
Template as separate repo (wrong)
blueprints/. I had to tell it: “Davis IS the template.” The concrete example and the template should be the same thing. This was a product decision the AI could not have known.Hardcoded 'qual' everywhere
qual in variable names (aws_profile_qual), file names (qual.yml), and defaults. I had to push back multiple times: “qual is extracted from the profile, no hardcoded references.” The AI kept defaulting to the reference repo’s conventions.Vite build-time env vars for runtime config
VITE_ENTRA_CLIENT_ID, which gets baked into the JavaScript bundle at build time. The Docker build had no access to these values. Claude’s fix (a /api/v1/config endpoint returning public config at runtime) was good, but it should not have used build-time env vars in the first place.Unnecessary secret for PKCE flow
entra_client_secret in Secrets Manager. I had to ask: “do we actually need this?” The answer was no. MSAL with PKCE is a public client flow. The backend validates ID tokens against Microsoft’s public JWKS keys. No secret needed. Removing it eliminated the secret-seeding problem entirely.ECS not redeploying with new images
:latest tag, same env vars), so terraform saw no diff. The fix was injecting BUILD_ID = null_resource.build[0].id into the container environment, forcing a new task definition revision on every apply.The Result#
In one session, the template went from zero to production-tested across two AWS accounts. The final state:
- One command generates a fully configured project:
copier copy --trust git+ssh://...davis.git ./my-app - Modular infrastructure: Fargate or Lambda, optional ALB/Route53/SSL, optional Aurora, optional S3, optional EntraID SSO
- AWS auto-discovery: VPC, subnets, security groups, account ID, ECR registry. All from the profile name.
- Branch-per-account delivery: push to
gapmgm-italy-qualbranch deploys to that account. Push togapmgm-italy-proddeploys to prod. Same pipeline, different targets. - Security-tested: CORS restricted, docs hidden in non-local, nginx version suppressed, JWT validation verified against fake tokens and alg:none attacks
- Auto-scaling: target-tracking on CPU, min 1 task, max configurable (default 5), 15-second scale-out cooldown
The session produced roughly 60 files across the template, covering Terraform, Python (FastAPI), TypeScript (React), Docker, nginx, GitLab CI, and operational scripts.
Based on past experience, building a template like this manually takes about two weeks:
- Week 1: Analyze reference repos, design template structure, write Terraform modules, write application scaffolding, set up CI/CD
- Week 2: Test across accounts, fix IAM permissions, fix networking, fix secret management, security review, documentation
And it would still need iteration after the first team tries to use it.
One extended session. Real deployments to real accounts. 30+ fix iterations. Security testing included. Documentation generated. Template tested with multiple configurations (Fargate full-stack, Lambda minimal, Fargate backend-only).
The key was not speed of code generation. It was speed of iteration. Deploy, break, fix, deploy. Each cycle took minutes instead of hours because the AI retained full context of everything it had built.
Key Takeaway#
The most valuable thing about using Claude Code for infrastructure work is not that it writes Terraform. It is that it can hold the entire system in context, across Terraform, Docker, Python, TypeScript, nginx, and CI/CD, and diagnose cross-cutting issues instantly.
When an ECS task failed because a Secrets Manager secret had no value, Claude did not just see the error. It understood that Terraform creates the secret resource but not the secret version, that the task definition references the secret ARN, and that the fix was adding an aws_secretsmanager_secret_version with a placeholder and lifecycle { ignore_changes }. That chain of reasoning across three files and two AWS services is where the real value is.
The pattern that works: give high-level direction, deploy to real infrastructure early, use errors as prompts, and let the AI iterate. Do not try to get it perfect in planning. Get it deployed and let production tell you what is wrong.
If you want to go deeper on any of this, I offer 1:1 coaching sessions for engineers working on AI integration, cloud architecture, and platform engineering. Book a session (50 EUR / 60 min) or reach out at manuel.fedele+website@gmail.com.