Skip to main content

Shipping a DORA Compliance Tool from Zero to Production in One Day

Table of Contents
I spent a single session building a DORA-compliant Third Party Management tool from scratch, adding Microsoft Entra ID SSO, migrating the entire infrastructure from CloudFormation to Terraform, deploying to both qual and prod on ECS Fargate, and fixing a cascade of real-world deployment problems along the way. Here is an honest account of what broke and how I fixed it.

I built a DORA-compliant Third Party Management (TPM) tool: a Django + React application for managing ICT provider lifecycles, risk assessments, and regulatory registers. Then I wired up Microsoft Entra ID SSO, containerized it, and deployed it to AWS ECS Fargate using a Terraform pipeline, in a corporate environment full of proxies, permission boundaries, and shared infrastructure.

This post covers the full journey from the first commit to two healthy production tasks, including every failure and fix along the way.

What the Application Does
#

The TPM tool manages the full lifecycle of third-party ICT providers under the DORA regulation. The core workflow is a five-phase state machine per provider:

flowchart LR
    ID[Identification] --> DD[Due Diligence]
    DD --> AG[Agreement]
    AG --> MO[Monitoring]
    MO --> EX[Exit]

Each case tracks risk assessments, contractual arrangements, SLA reports, and generates the regulatory RT registers (RT.01.01, RT.02.01, etc.) that banks must submit to supervisory authorities. The app has 15 Django apps, 30+ API endpoints, a dynamic questionnaire engine, contract clause templates, and a separate third-party portal.

The stack:

  • Backend: Django 5.1 / DRF / PostgreSQL 16 / SimpleJWT
  • Frontend: React 18 / TypeScript / Vite / shadcn/ui
  • Auth: Microsoft Entra ID SSO (OAuth2 + PKCE)
  • Infra: ECS Fargate / Aurora Serverless v2 / ALB / Terraform

Adding SSO Without MSAL
#

The first task was adding Microsoft Entra ID SSO. The requirement was clear: no MSAL library, no server-side sessions. Pure OAuth2 Authorization Code flow with PKCE, implemented from scratch.

The flow looks like this:

sequenceDiagram
    participant Browser
    participant Backend
    participant Microsoft
    participant Graph

    Browser->>Backend: GET /api/auth/sso/config/
    Backend-->>Browser: {client_id, authority, redirect_uri}
    Browser->>Browser: Generate PKCE verifier + challenge
    Browser->>Microsoft: Redirect to /authorize?code_challenge=...
    Microsoft-->>Browser: Redirect to /auth/callback?code=...
    Browser->>Backend: POST /api/auth/sso/callback/ {code, code_verifier}
    Backend->>Microsoft: Exchange code for tokens
    Microsoft-->>Backend: {id_token, access_token}
    Backend->>Backend: Validate id_token via JWKS
    Backend->>Graph: GET /me/photo/$value
    Graph-->>Backend: JPEG bytes
    Backend->>Backend: Store photo as base64 data URI in DB
    Backend-->>Browser: {access, refresh} JWT tokens

On the backend, I validate the ID token by fetching Microsoft’s JWKS keys (cached for one hour) and verifying the RS256 signature with PyJWT:

backend/apps/accounts/sso.py
def _validate_id_token(id_token: str) -> dict[str, Any]:
    header = jwt.get_unverified_header(id_token)
    keys = _get_jwks_keys()
    matching = [k for k in keys if k.get("kid") == header.get("kid")]
    if not matching:
        raise jwt.InvalidTokenError(f"No matching key for kid={header.get('kid')}")
    public_key = jwt.algorithms.RSAAlgorithm.from_jwk(matching[0])
    return jwt.decode(
        id_token,
        key=public_key,
        algorithms=["RS256"],
        audience=settings.ENTRA_CLIENT_ID,
        issuer=settings.ENTRA_ISSUER,
    )

On the frontend, PKCE is implemented using the Web Crypto API with zero external dependencies:

frontend/src/lib/pkce.ts
export function generateCodeVerifier(): string {
  const array = new Uint8Array(64);
  crypto.getRandomValues(array);
  return base64urlEncode(array.buffer);
}

export async function generateCodeChallenge(verifier: string): Promise<string> {
  const data = new TextEncoder().encode(verifier);
  const digest = await crypto.subtle.digest('SHA-256', data);
  return base64urlEncode(digest);
}

One subtle bug I hit: React 18’s StrictMode runs effects twice in development. The OAuth callback page cleaned sessionStorage in the first run, so the second run found no PKCE verifier and flashed a “Sign-in failed” error before the navigation succeeded. The fix was a useRef guard:

frontend/src/pages/LoginCallbackPage.tsx
const startedRef = useRef(false);

useEffect(() => {
  if (startedRef.current) return;
  startedRef.current = true;
  // ... exchange code, navigate
}, []);
Tip

Profile photos from Microsoft Graph are fetched during the SSO callback using the access_token, then stored as a base64 data URI in a TextField on the User model. This avoids the need for persistent file storage (important on Fargate where containers are ephemeral).

Migrating from CloudFormation to Terraform
#

The original infrastructure used CloudFormation. After a series of painful deployment failures with CloudFormation change sets (stack stuck in ROLLBACK_FAILED, Aurora subnet groups blocking deletion, ForceNewDeployment not valid on CREATE), I migrated everything to Terraform, modeled after an internal reference project.

The Terraform layout:

infrastructure/terraform/
  provider.tf      # AWS provider, S3 backend
  variables.tf     # Input variables
  locals.tf        # Naming conventions, env-specific subnets
  main.tf          # ECS, ALB, IAM, Secrets Manager
  aurora.tf        # Aurora Serverless v2
  outputs.tf       # URLs, cluster endpoints
  backends/
    qual.hcl
    prod.hcl

A key design choice: Aurora uses manage_master_user_password = true, so RDS handles credential rotation automatically and stores the JSON secret in Secrets Manager. ECS extracts the individual fields using the JSON key syntax:

infrastructure/terraform/main.tf
secrets = [
  {
    name      = "DB_USERNAME"
    valueFrom = "${aws_rds_cluster.aurora[0].master_user_secret[0].secret_arn}:username::"
  },
  {
    name      = "DB_PASSWORD"
    valueFrom = "${aws_rds_cluster.aurora[0].master_user_secret[0].secret_arn}:password::"
  }
]

The ECR registry was initially hardcoded to the qual account ID. This caused prod containers to fail pulling images with a 403. The fix was to use data.aws_caller_identity.current.account_id dynamically:

infrastructure/terraform/locals.tf
registry = "${data.aws_caller_identity.current.account_id}.dkr.ecr.${var.aws_default_region}.amazonaws.com"

The Pipeline: 15 Fixes to Get to Green
#

Getting the GitLab CI/CD pipeline to work against a corporate environment took more iterations than I expected. Here is the full list of failures and fixes, in order:

SonarQube jobs stuck pending: no runner with 'mgm' tag
The base CI template requires a runner tagged mgm for SonarQube. Our runners were tagged mgm-qual and mgm-prod. Fix: added the mgm tag to the prod runner via the GitLab API.
Docker build failed: uv sync --no-editable error
uv sync --frozen --no-dev --no-editable failed because hatchling couldn’t find the project package. Fix: switched to --no-install-project (install only dependencies, not the project itself).
CloudFormation CS_TYPE detection bug
The plan script used CS_TYPE=$(aws describe-stacks ... && echo UPDATE || echo CREATE). When the stack existed, this produced DELETE_COMPLETE\nUPDATE (two lines), causing create-change-set --change-set-type DELETE_COMPLETE UPDATE to fail. Fix: properly parse the status string.
ECS service CREATE_FAILED: model validation
CloudFormation’s AWS::ECS::Service rejected integer values for Weight, DesiredCount, etc. as strings. Also, ForceNewDeployment is not valid on initial CREATE. Fix: removed ForceNewDeployment. Migration to Terraform also eliminated these YAML-to-JSON type coercion issues.
Terraform providers failing: registry.terraform.io unreachable
The runner is behind a corporate proxy. The before_script from the base CI template sets HTTP_PROXY, but our job-level before_script overrides it. Fix: explicitly export proxy vars before terraform init.
apk add failed: Docker image uses Debian, not Alpine
The runner image (minion-base-aws) is Debian-based. Our pipeline tried apk add docker-cli-buildx, which failed. Fix: removed the apk add line (runner already has the tools it needs).
Docker build no space left on device
The Docker daemon on the runner filled up (8GB EBS). The build context included terraform-modules/ cloned by the base template’s before_script. Fix: added .dockerignore, used docker system prune -af before building, and expanded runner EBS to 50GB via gosp-cloud CLI.
npm install ERESOLVE error
The Dockerfile deleted package-lock.json and ran npm install from scratch, which hit peer dependency conflicts. Fix: replaced with npm ci (uses the lockfile).
apt-get fails inside Docker: can't reach deb.debian.org
Docker build args HTTP_PROXY and HTTPS_PROXY are set, but apt doesn’t always respect shell environment variables. Fix: explicitly write an apt proxy config file before apt-get update.
staticfiles permission error on ECS
collectstatic failed with PermissionError: /app/backend/staticfiles/admin. The Dockerfile created the directory as root before switching to appuser. Fix: added RUN chown -R appuser:appuser /app after all copies.
ALB health check returns 400
The ALB health checker sends the container’s private IP as the Host header. Django rejects it via ALLOWED_HOSTS. Fix: nginx forwards Host: localhost (not $host) for the /api/health/ endpoint specifically.
Mixed content: avatar served over HTTP
Django was generating http:// URLs for media files because the ALB terminates TLS and forwards plain HTTP to nginx. Fix: added SECURE_PROXY_SSL_HEADER = ('HTTP_X_FORWARDED_PROTO', 'https') in prod settings, and nginx now forwards the original X-Forwarded-Proto from the ALB.
Avatar 404 after deploy
The DRF serializer was treating the base64 data URI string as a file path, prepending /media/. The ImageField was changed to TextField in the model, but the serializer auto-detected it differently. Fix: explicitly declare avatar = serializers.CharField(read_only=True, default=None) in the serializer.
ECR 403 on prod ECS
The ECR registry URL was hardcoded to the qual AWS account ID. The prod ECS tasks tried to pull from the wrong account. Fix: use data.aws_caller_identity.current.account_id dynamically in locals.
seed_data AttributeError: TPMCase has no attribute title
The notification generator called case.title but the field is case.case_number. Fix: four-line grep-and-replace in notification_generator.py.

Branch Protection and the Release Flow
#

Once everything worked, I set up branch protection to enforce a proper promotion flow:

flowchart LR
    main["main\n(Developers+)"] -->|MR| prerelease["pre-release\n(Maintainers)"]
    prerelease -->|MR| release["release\n(No direct push)"]
    main -->|CI trigger| qual[qual environment]
    release -->|CI trigger| prod[prod environment]

The release branch has push_access_level = 0 (no one can push directly). Every prod deploy requires a merge request through pre-release. The apply step in both pipelines is manual, giving the team a review gate before infrastructure changes land.

The Numbers
#

After the full session:

  • 5,758 lines added, 831 lines removed across 16 commits (just the infra/SSO work)
  • 526 backend tests passing
  • SonarQube: 0 open bugs or vulnerabilities (7 fixed, 158 test-password hotspots marked safe)
  • qual deployed: 2/2 ECS tasks running, database seeded
  • prod deployed: 2/2 ECS tasks running

The wall-clock time was just under 9 hours. Two hours of that was waiting for Aurora Serverless v2 to provision (about 7 minutes per stack, across several failed and retried deployments).

Lessons
#

A few things that would have saved significant time:

DORA compliance work is infrastructure-heavy. The regulatory register views, risk calculations, and lifecycle state machine are complex, but the deployment pipeline caused more total friction than the application logic.

Terraform beats CloudFormation for iterative development. The terraform plan output is clearer, the state management is more transparent, and the lifecycle { ignore_changes = [...] } escape hatch saved several hours around Aurora password management.

Corporate proxy + Docker = always check the Dockerfile. Every RUN that fetches anything (apt, pip, npm) needs proxy awareness. Writing $HTTP_PROXY as an ARG/ENV is not always sufficient. Explicitly configuring apt (/etc/apt/apt.conf.d/proxy.conf), pip (--proxy), and npm (.npmrc) is safer.

Fargate containers are stateless by design. Storing avatar photos as base64 data URIs in the database was the right call for this scale. For larger files or high traffic, an S3-backed media bucket is the proper solution, but it adds operational complexity.

ECS Exec is worth enabling from day one. Running manage.py seed_data required launching a one-off task with a complex command override. With ECS Exec enabled, it would have been a single aws ecs execute-command call.


If you’re building DORA compliance tooling or setting up ECS Fargate with Terraform in a corporate environment, reach out at manuel.fedele+website@gmail.com.

Related