I built a DORA-compliant Third Party Management (TPM) tool: a Django + React application for managing ICT provider lifecycles, risk assessments, and regulatory registers. Then I wired up Microsoft Entra ID SSO, containerized it, and deployed it to AWS ECS Fargate using a Terraform pipeline, in a corporate environment full of proxies, permission boundaries, and shared infrastructure.
This post covers the full journey from the first commit to two healthy production tasks, including every failure and fix along the way.
What the Application Does#
The TPM tool manages the full lifecycle of third-party ICT providers under the DORA regulation. The core workflow is a five-phase state machine per provider:
flowchart LR
ID[Identification] --> DD[Due Diligence]
DD --> AG[Agreement]
AG --> MO[Monitoring]
MO --> EX[Exit]
Each case tracks risk assessments, contractual arrangements, SLA reports, and generates the regulatory RT registers (RT.01.01, RT.02.01, etc.) that banks must submit to supervisory authorities. The app has 15 Django apps, 30+ API endpoints, a dynamic questionnaire engine, contract clause templates, and a separate third-party portal.
The stack:
- Backend: Django 5.1 / DRF / PostgreSQL 16 / SimpleJWT
- Frontend: React 18 / TypeScript / Vite / shadcn/ui
- Auth: Microsoft Entra ID SSO (OAuth2 + PKCE)
- Infra: ECS Fargate / Aurora Serverless v2 / ALB / Terraform
Adding SSO Without MSAL#
The first task was adding Microsoft Entra ID SSO. The requirement was clear: no MSAL library, no server-side sessions. Pure OAuth2 Authorization Code flow with PKCE, implemented from scratch.
The flow looks like this:
sequenceDiagram
participant Browser
participant Backend
participant Microsoft
participant Graph
Browser->>Backend: GET /api/auth/sso/config/
Backend-->>Browser: {client_id, authority, redirect_uri}
Browser->>Browser: Generate PKCE verifier + challenge
Browser->>Microsoft: Redirect to /authorize?code_challenge=...
Microsoft-->>Browser: Redirect to /auth/callback?code=...
Browser->>Backend: POST /api/auth/sso/callback/ {code, code_verifier}
Backend->>Microsoft: Exchange code for tokens
Microsoft-->>Backend: {id_token, access_token}
Backend->>Backend: Validate id_token via JWKS
Backend->>Graph: GET /me/photo/$value
Graph-->>Backend: JPEG bytes
Backend->>Backend: Store photo as base64 data URI in DB
Backend-->>Browser: {access, refresh} JWT tokens
On the backend, I validate the ID token by fetching Microsoft’s JWKS keys (cached for one hour) and verifying the RS256 signature with PyJWT:
def _validate_id_token(id_token: str) -> dict[str, Any]:
header = jwt.get_unverified_header(id_token)
keys = _get_jwks_keys()
matching = [k for k in keys if k.get("kid") == header.get("kid")]
if not matching:
raise jwt.InvalidTokenError(f"No matching key for kid={header.get('kid')}")
public_key = jwt.algorithms.RSAAlgorithm.from_jwk(matching[0])
return jwt.decode(
id_token,
key=public_key,
algorithms=["RS256"],
audience=settings.ENTRA_CLIENT_ID,
issuer=settings.ENTRA_ISSUER,
)On the frontend, PKCE is implemented using the Web Crypto API with zero external dependencies:
export function generateCodeVerifier(): string {
const array = new Uint8Array(64);
crypto.getRandomValues(array);
return base64urlEncode(array.buffer);
}
export async function generateCodeChallenge(verifier: string): Promise<string> {
const data = new TextEncoder().encode(verifier);
const digest = await crypto.subtle.digest('SHA-256', data);
return base64urlEncode(digest);
}One subtle bug I hit: React 18’s StrictMode runs effects twice in development. The OAuth callback page cleaned sessionStorage in the first run, so the second run found no PKCE verifier and flashed a “Sign-in failed” error before the navigation succeeded. The fix was a useRef guard:
const startedRef = useRef(false);
useEffect(() => {
if (startedRef.current) return;
startedRef.current = true;
// ... exchange code, navigate
}, []);Profile photos from Microsoft Graph are fetched during the SSO callback using the access_token, then stored as a base64 data URI in a TextField on the User model. This avoids the need for persistent file storage (important on Fargate where containers are ephemeral).
Migrating from CloudFormation to Terraform#
The original infrastructure used CloudFormation. After a series of painful deployment failures with CloudFormation change sets (stack stuck in ROLLBACK_FAILED, Aurora subnet groups blocking deletion, ForceNewDeployment not valid on CREATE), I migrated everything to Terraform, modeled after an internal reference project.
The Terraform layout:
infrastructure/terraform/
provider.tf # AWS provider, S3 backend
variables.tf # Input variables
locals.tf # Naming conventions, env-specific subnets
main.tf # ECS, ALB, IAM, Secrets Manager
aurora.tf # Aurora Serverless v2
outputs.tf # URLs, cluster endpoints
backends/
qual.hcl
prod.hclA key design choice: Aurora uses manage_master_user_password = true, so RDS handles credential rotation automatically and stores the JSON secret in Secrets Manager. ECS extracts the individual fields using the JSON key syntax:
secrets = [
{
name = "DB_USERNAME"
valueFrom = "${aws_rds_cluster.aurora[0].master_user_secret[0].secret_arn}:username::"
},
{
name = "DB_PASSWORD"
valueFrom = "${aws_rds_cluster.aurora[0].master_user_secret[0].secret_arn}:password::"
}
]The ECR registry was initially hardcoded to the qual account ID. This caused prod containers to fail pulling images with a 403. The fix was to use data.aws_caller_identity.current.account_id dynamically:
registry = "${data.aws_caller_identity.current.account_id}.dkr.ecr.${var.aws_default_region}.amazonaws.com"The Pipeline: 15 Fixes to Get to Green#
Getting the GitLab CI/CD pipeline to work against a corporate environment took more iterations than I expected. Here is the full list of failures and fixes, in order:
SonarQube jobs stuck pending: no runner with 'mgm' tag
mgm for SonarQube. Our runners were tagged mgm-qual and mgm-prod. Fix: added the mgm tag to the prod runner via the GitLab API.Docker build failed: uv sync --no-editable error
uv sync --frozen --no-dev --no-editable failed because hatchling couldn’t find the project package. Fix: switched to --no-install-project (install only dependencies, not the project itself).CloudFormation CS_TYPE detection bug
CS_TYPE=$(aws describe-stacks ... && echo UPDATE || echo CREATE). When the stack existed, this produced DELETE_COMPLETE\nUPDATE (two lines), causing create-change-set --change-set-type DELETE_COMPLETE UPDATE to fail. Fix: properly parse the status string.ECS service CREATE_FAILED: model validation
AWS::ECS::Service rejected integer values for Weight, DesiredCount, etc. as strings. Also, ForceNewDeployment is not valid on initial CREATE. Fix: removed ForceNewDeployment. Migration to Terraform also eliminated these YAML-to-JSON type coercion issues.Terraform providers failing: registry.terraform.io unreachable
before_script from the base CI template sets HTTP_PROXY, but our job-level before_script overrides it. Fix: explicitly export proxy vars before terraform init.apk add failed: Docker image uses Debian, not Alpine
minion-base-aws) is Debian-based. Our pipeline tried apk add docker-cli-buildx, which failed. Fix: removed the apk add line (runner already has the tools it needs).Docker build no space left on device
terraform-modules/ cloned by the base template’s before_script. Fix: added .dockerignore, used docker system prune -af before building, and expanded runner EBS to 50GB via gosp-cloud CLI.npm install ERESOLVE error
package-lock.json and ran npm install from scratch, which hit peer dependency conflicts. Fix: replaced with npm ci (uses the lockfile).apt-get fails inside Docker: can't reach deb.debian.org
HTTP_PROXY and HTTPS_PROXY are set, but apt doesn’t always respect shell environment variables. Fix: explicitly write an apt proxy config file before apt-get update.staticfiles permission error on ECS
collectstatic failed with PermissionError: /app/backend/staticfiles/admin. The Dockerfile created the directory as root before switching to appuser. Fix: added RUN chown -R appuser:appuser /app after all copies.ALB health check returns 400
Host header. Django rejects it via ALLOWED_HOSTS. Fix: nginx forwards Host: localhost (not $host) for the /api/health/ endpoint specifically.Mixed content: avatar served over HTTP
http:// URLs for media files because the ALB terminates TLS and forwards plain HTTP to nginx. Fix: added SECURE_PROXY_SSL_HEADER = ('HTTP_X_FORWARDED_PROTO', 'https') in prod settings, and nginx now forwards the original X-Forwarded-Proto from the ALB.Avatar 404 after deploy
/media/. The ImageField was changed to TextField in the model, but the serializer auto-detected it differently. Fix: explicitly declare avatar = serializers.CharField(read_only=True, default=None) in the serializer.ECR 403 on prod ECS
data.aws_caller_identity.current.account_id dynamically in locals.seed_data AttributeError: TPMCase has no attribute title
case.title but the field is case.case_number. Fix: four-line grep-and-replace in notification_generator.py.Branch Protection and the Release Flow#
Once everything worked, I set up branch protection to enforce a proper promotion flow:
flowchart LR
main["main\n(Developers+)"] -->|MR| prerelease["pre-release\n(Maintainers)"]
prerelease -->|MR| release["release\n(No direct push)"]
main -->|CI trigger| qual[qual environment]
release -->|CI trigger| prod[prod environment]
The release branch has push_access_level = 0 (no one can push directly). Every prod deploy requires a merge request through pre-release. The apply step in both pipelines is manual, giving the team a review gate before infrastructure changes land.
The Numbers#
After the full session:
- 5,758 lines added, 831 lines removed across 16 commits (just the infra/SSO work)
- 526 backend tests passing
- SonarQube: 0 open bugs or vulnerabilities (7 fixed, 158 test-password hotspots marked safe)
- qual deployed: 2/2 ECS tasks running, database seeded
- prod deployed: 2/2 ECS tasks running
The wall-clock time was just under 9 hours. Two hours of that was waiting for Aurora Serverless v2 to provision (about 7 minutes per stack, across several failed and retried deployments).
Lessons#
A few things that would have saved significant time:
DORA compliance work is infrastructure-heavy. The regulatory register views, risk calculations, and lifecycle state machine are complex, but the deployment pipeline caused more total friction than the application logic.
Terraform beats CloudFormation for iterative development. The terraform plan output is clearer, the state management is more transparent, and the lifecycle { ignore_changes = [...] } escape hatch saved several hours around Aurora password management.
Corporate proxy + Docker = always check the Dockerfile. Every RUN that fetches anything (apt, pip, npm) needs proxy awareness. Writing $HTTP_PROXY as an ARG/ENV is not always sufficient. Explicitly configuring apt (/etc/apt/apt.conf.d/proxy.conf), pip (--proxy), and npm (.npmrc) is safer.
Fargate containers are stateless by design. Storing avatar photos as base64 data URIs in the database was the right call for this scale. For larger files or high traffic, an S3-backed media bucket is the proper solution, but it adds operational complexity.
ECS Exec is worth enabling from day one. Running manage.py seed_data required launching a one-off task with a complex command override. With ECS Exec enabled, it would have been a single aws ecs execute-command call.
If you’re building DORA compliance tooling or setting up ECS Fargate with Terraform in a corporate environment, reach out at manuel.fedele+website@gmail.com.