Building an AI-Powered Document Processing Pipeline on AWS#

Insurance companies process millions of documents every year: police reports, medical records, invoices, repair estimates. Traditionally, human operators read each document, classify it, extract the relevant fields, and enter the data into the claims system. This is slow, expensive, and error-prone.

In this post I’ll describe the architecture of a production document processing pipeline I helped build. The system ingests claim documents, extracts text using vision-based LLMs, clusters and classifies document sections, extracts structured data, and generates vector embeddings for semantic search. All of this runs on a fully serverless AWS architecture with no idle infrastructure costs.

The Problem Space#

A typical insurance claim arrives as a ZIP archive containing multiple documents: a police report, medical certificates, repair estimates, photos, correspondence. Each document needs to be:

  1. Parsed into machine-readable text (many are scanned PDFs or photos)
  2. Classified by type (police report, medical report, invoice, etc.)
  3. Structured by extracting specific fields per document type (e.g., license plate from a police report, diagnosis code from a medical report)
  4. Indexed for semantic search so case managers can query across all documents in natural language
  5. Summarized so the case manager gets an overview without reading every page

The system needs to handle multiple lines of business (motor, health, accident), support concurrent processing of hundreds of claims, and recover gracefully from LLM failures.

Architecture Overview#

The architecture follows an event-driven, fully serverless pattern:

ZIP Upload → S3 → EventBridge → Step Functions → [Lambda Pipeline] → DynamoDB + Aurora
                                                                          ↓
                                              Claims Service (ECS) ← ALB ← Users

The key components:

  • S3 Ingestion Bucket: receives ZIP uploads via presigned URLs
  • EventBridge Rule: triggers the processing pipeline when a file lands
  • Step Functions: orchestrates the multi-stage processing workflow
  • Lambda Functions: execute each processing stage (stateless, parallel)
  • DynamoDB: tracks processing state and stores extraction results
  • Aurora PostgreSQL: stores vector embeddings for RAG
  • ECS Service (Django): serves the web UI and API for case managers
  • Keycloak + Entra ID: authentication and group-based access control

The Ingestion Flow#

Documents enter the system through a secure upload flow:

  1. The external claims management system calls our API to register a claim (line of business, permissions, metadata)
  2. It requests a presigned S3 URL from a lightweight Lambda function
  3. It uploads the ZIP archive directly to S3 using the presigned URL
  4. An EventBridge rule detects the new object and triggers the Step Functions workflow

The presigned URL approach avoids routing large files through our API. The Lambda generates short-lived URLs (10-60 seconds), so the upload window is narrow enough to prevent misuse.

import boto3
from datetime import datetime

def handler(event, context):
    s3 = boto3.client("s3")

    claim_name = event["claim_name"]
    document_code = event["document_code"]
    key = f"landing/{claim_name}#{document_code}.zip"

    url = s3.generate_presigned_url(
        "put_object",
        Params={
            "Bucket": INGESTION_BUCKET,
            "Key": key,
            "ContentType": "application/zip",
        },
        ExpiresIn=60,
    )

    return {"upload_url": url, "key": key}

Processing runs in periodic batches (every 10 minutes) rather than on each individual upload. This is a deliberate tradeoff: we accept a few minutes of latency in exchange for better throughput management and cost control when handling bursts of uploads.

The Step Functions Pipeline#

The core of the system is an AWS Step Functions state machine that orchestrates document processing through a linear phase followed by two parallel branches.

Linear Phase: Parse and Extract Text#

Stage 1: Start (claim-level)

The first Lambda validates the uploaded ZIP, extracts files, uploads individual documents to a support bucket, and creates tracking records in DynamoDB. It also deduplicates: if a document with the same name already exists in the claim and is being processed, it’s skipped.

Stage 2: Split (document-level)

Each document is converted to PDF (if it’s a DOCX), then split into individual pages saved as PNG images. We use an internal PDF splitting service for this, which handles edge cases like encrypted PDFs, malformed page trees, and oversized documents.

Stage 3: Parse (page-level)

This is where the AI kicks in. Each page image is sent to Claude 3.5 Sonnet (via Amazon Bedrock) for vision-based text extraction. The LLM reads the image and produces clean Markdown text, handling handwritten notes, stamps, tables, and mixed layouts that traditional OCR tools struggle with.

import json
import base64
import boto3

def parse_page(image_bytes: bytes, page_number: int) -> str:
    bedrock = boto3.client("bedrock-runtime")

    response = bedrock.invoke_model(
        modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 4096,
            "messages": [{
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": base64.b64encode(image_bytes).decode(),
                        },
                    },
                    {
                        "type": "text",
                        "text": "Extract all text from this document page. "
                                "Preserve the structure using Markdown formatting. "
                                "Include tables, headers, and any handwritten text.",
                    },
                ],
            }],
        }),
    )

    result = json.loads(response["body"].read())
    return result["content"][0]["text"]

Why Claude 3.5 Sonnet for OCR instead of Amazon Textract? Two reasons: (1) vision LLMs handle messy real-world documents (stamps, handwriting, mixed layouts) significantly better than traditional OCR, and (2) the output is already structured Markdown, which downstream stages can work with directly without an intermediate parsing step.

Stage 4: Page Checker (document-level)

LLM calls fail. Rate limits, timeouts, transient errors. The Page Checker implements retry logic: it collects results from all page parsing calls, identifies failures, and re-dispatches failed pages up to 3 times. This is essential when processing documents with dozens of pages, where even a 1% failure rate means most documents would have at least one failed page.

def check_pages(document_id: str, page_results: list[dict]) -> dict:
    successful = [p for p in page_results if p["status"] == "success"]
    failed = [p for p in page_results if p["status"] == "error"]

    retryable = [
        p for p in failed
        if p.get("retry_count", 0) < MAX_RETRIES
    ]

    if retryable:
        # Re-dispatch failed pages for another attempt
        for page in retryable:
            page["retry_count"] = page.get("retry_count", 0) + 1
        return {"status": "retry", "pages": retryable}

    if not successful:
        return {"status": "error", "message": "All pages failed parsing"}

    # Proceed with whatever pages succeeded
    return {
        "status": "success",
        "parsed_pages": len(successful),
        "failed_pages": len(failed),
    }

Upper Branch: Embedding Pipeline#

After text extraction, the pipeline forks into two parallel branches. The upper branch creates vector embeddings for semantic search.

Chunker: Combines all page texts into a single document, then splits it into overlapping chunks with metadata (page numbers, positions). Each chunk is saved to S3.

Embedder: Each chunk is embedded using OpenAI’s text-embedding-ada-002 model. We chose ada-002 for its good balance of quality and cost at scale. Embeddings are stored in Aurora PostgreSQL using the pgvector extension, enabling similarity search across all documents in a claim.

This powers a RAG (Retrieval-Augmented Generation) interface where case managers can ask questions like “What was the estimated repair cost?” and get answers grounded in the actual documents.

Lower Branch: Classification and Extraction Pipeline#

The lower branch handles structured information extraction.

Clusterization: A single insurance document PDF often contains multiple logical sections (a police report followed by a medical certificate followed by repair photos). The clusterization stage identifies contiguous page ranges that belong to the same topic. We use Gemini 2.0 Flash for this because it’s fast, cheap, and performs well on the classification-style reasoning this step requires.

def clusterize_document(pages: list[dict]) -> list[dict]:
    """Identify clusters of pages about the same topic."""

    # Build context from all pages
    page_summaries = "\n".join(
        f"Page {p['number']}: {p['text'][:200]}..."
        for p in pages
    )

    prompt = f"""Analyze these document pages and identify clusters
    of consecutive pages that discuss the same topic.

    Pages:
    {page_summaries}

    Return a JSON array of clusters, each with:
    - start_page: first page number
    - end_page: last page number
    - topic: brief description of what this section covers
    """

    # Call Gemini 2.0 Flash via the LLM Gateway
    response = llm_gateway.invoke(
        provider="vertex",
        model="gemini-2.0-flash",
        prompt=prompt,
    )

    return json.loads(response)

Classification: Each cluster is labeled with a document type (police report, medical certificate, invoice, repair estimate, etc.). The classification model uses the cluster’s text content and the claim’s line of business to assign the most appropriate label.

Extraction: This is where it gets domain-specific. Based on the classification label, the extraction stage applies a tailored extraction strategy. A police report gets license plate, driver name, date, and location extracted. A medical report gets diagnosis, treatment, and provider extracted. An invoice gets line items and totals extracted.

The results are stored as structured JSON in DynamoDB, making them queryable and displayable in the web UI.

Cluster Aggregator: Collects results from all clusters, validates completeness, and updates the document’s processing status.

The Multi-Model LLM Strategy#

One of the most interesting architectural decisions was using three different LLM providers, each selected for a specific task:

TaskModelWhy
Text extraction (OCR)Claude 3.5 Sonnet (Bedrock)Best vision capabilities for messy documents
Embeddingstext-embedding-ada-002 (OpenAI)Cost-effective, high-quality embeddings at scale
Clusterization/ClassificationGemini 2.0 Flash (Vertex AI)Fast and cheap for reasoning tasks

All LLM calls go through an internal gateway service that abstracts the provider differences. The gateway handles authentication, rate limiting, usage tracking, and fallback logic. From the pipeline’s perspective, it’s just calling an API with a provider and model name.

This multi-model approach lets us optimize for cost and quality per task rather than being locked into a single provider. The text extraction stage is the most expensive (vision + large context), so we use the best model available. The clusterization stage processes much less data and needs speed more than depth, so we use a fast, cheap model.

Concurrency and Error Handling#

The Step Functions state machine uses Map states to process documents and pages in parallel. A single claim might contain 20 documents, each with 50 pages, resulting in 1,000 parallel page-processing Lambda invocations.

Key patterns:

Idempotency: Every Lambda function is idempotent. If a Step Functions execution is retried (due to a transient error), re-processing the same input produces the same result without side effects. We use DynamoDB conditional writes to prevent duplicate processing.

Graceful degradation: If the embedding branch fails for a document, the classification branch still completes (and vice versa). A document with failed embeddings can still have its structured data extracted. The system tracks partial success at every level.

Correlation tracking: Every request gets a correlation ID that flows through all Lambda invocations, S3 objects, and DynamoDB records. When something fails, you can trace the entire processing chain from upload to the specific failed step.

The Claims Service#

The web UI is a Django application running on ECS Fargate, integrated as a microfrontend into the larger analytics platform. Case managers can:

  • Browse claims and their documents
  • View extracted text alongside the original document images
  • See structured extraction results (fields, values, confidence)
  • Search across all documents using natural language (RAG)
  • Generate claim summaries on demand
  • Track document processing status in real time

Authentication uses Keycloak with JWT tokens. Authorization uses Entra ID groups: each claim is associated with a visibility group, and only members of that group can access the claim’s documents. A daily sync job keeps group membership current.

Performance and Cost#

Some numbers from production:

  • Processing time: a 30-page document takes approximately 3-5 minutes end-to-end (dominated by LLM call latency)
  • Throughput: the system handles 500+ documents per hour during peak periods
  • Cost per document: roughly EUR 0.15-0.30, depending on page count and complexity (the bulk of the cost is LLM inference)
  • Infrastructure cost when idle: near zero (serverless)

The batch processing approach (every 10 minutes) means we can process multiple documents from the same claim together, which is more efficient than processing each upload individually.

Lessons Learned#

Vision LLMs are production-ready for OCR. Claude 3.5 Sonnet handles real-world insurance documents (stamps, handwriting, poor scans, mixed languages) far better than traditional OCR. The quality improvement justified the higher per-page cost.

Step Functions are the right tool for document pipelines. The built-in retry logic, parallel Map states, error handling, and visual debugging make Step Functions ideal for multi-stage document processing. We tried orchestrating with SQS queues initially, but the complexity of tracking state across stages was not worth it.

Multi-model is the way forward. No single LLM is best at everything. Using Claude for vision, OpenAI for embeddings, and Gemini for fast classification gave us the best cost-quality tradeoff at each stage.

Retry logic is not optional. LLM APIs fail more often than traditional APIs. Rate limits, timeouts, model overload. The Page Checker retry pattern (up to 3 attempts per page) is what makes the pipeline reliable enough for production.

Batch over real-time when you can. Processing uploads every 10 minutes instead of immediately simplified the architecture significantly and reduced costs. For insurance claims processing, a few minutes of latency is perfectly acceptable.

What’s Next#

The natural evolution is closing the loop: having the AI agent not just extract and classify documents, but also suggest claim decisions based on the extracted data and historical patterns. This moves from “AI assists the human” to “AI proposes, human approves,” which is the next frontier for insurance automation.

References#