Organizations process millions of documents every year: reports, contracts, invoices, correspondence. Traditionally, human operators read each document, classify it, extract the relevant fields, and enter the data into a management system. This is slow, expensive, and error-prone.
In this post I’ll describe the architecture of a production document processing pipeline I helped build. The system ingests documents, extracts text using vision-based LLMs, clusters and classifies document sections, extracts structured data, and generates vector embeddings for semantic search. All of this runs on a fully serverless AWS architecture with no idle infrastructure costs.
The Problem Space#
A typical submission arrives as a ZIP archive containing multiple documents: reports, certificates, invoices, photos, correspondence. Each document needs to be:
- Parsed into machine-readable text (many are scanned PDFs or photos)
- Classified by type (report, certificate, invoice, etc.)
- Structured by extracting specific fields per document type (e.g., reference number and date from a report, line items and totals from an invoice)
- Indexed for semantic search so operators can query across all documents in natural language
- Summarized so the operator gets an overview without reading every page
The system needs to handle multiple document categories, support concurrent processing of hundreds of batches, and recover gracefully from LLM failures.
Architecture Overview#
The architecture follows an event-driven, fully serverless pattern:
ZIP Upload → S3 → EventBridge → Step Functions → [Lambda Pipeline] → DynamoDB + Aurora
↓
Processing Service (ECS) ← ALB ← UsersThe key components:
- S3 Ingestion Bucket: receives ZIP uploads via presigned URLs
- EventBridge Rule: triggers the processing pipeline when a file lands
- Step Functions: orchestrates the multi-stage processing workflow
- Lambda Functions: execute each processing stage (stateless, parallel)
- DynamoDB: tracks processing state and stores extraction results
- Aurora PostgreSQL: stores vector embeddings for RAG
- ECS Service (Django): serves the web UI and API for operators
- Keycloak + Entra ID: authentication and group-based access control
The Ingestion Flow#
Documents enter the system through a secure upload flow:
- The external system calls our API to register a batch (category, permissions, metadata)
- It requests a presigned S3 URL from a lightweight Lambda function
- It uploads the ZIP archive directly to S3 using the presigned URL
- An EventBridge rule detects the new object and triggers the Step Functions workflow
The presigned URL approach avoids routing large files through our API. The Lambda generates short-lived URLs (10-60 seconds), so the upload window is narrow enough to prevent misuse.
import boto3 {title="processor.py"}
from datetime import datetime
def handler(event, context):
s3 = boto3.client("s3")
batch_name = event["batch_name"]
document_code = event["document_code"]
key = f"landing/{batch_name}#{document_code}.zip"
url = s3.generate_presigned_url(
"put_object",
Params={
"Bucket": INGESTION_BUCKET,
"Key": key,
"ContentType": "application/zip",
},
ExpiresIn=60,
)
return {"upload_url": url, "key": key}Processing runs in periodic batches (every 10 minutes) rather than on each individual upload. This is a deliberate tradeoff: we accept a few minutes of latency in exchange for better throughput management and cost control when handling bursts of uploads.
The Step Functions Pipeline#
The core of the system is an AWS Step Functions state machine that orchestrates document processing through a linear phase followed by two parallel branches.
Linear Phase: Parse and Extract Text#
Stage 1: Start (batch-level)
The first Lambda validates the uploaded ZIP, extracts files, uploads individual documents to a support bucket, and creates tracking records in DynamoDB. It also deduplicates: if a document with the same name already exists in the batch and is being processed, it’s skipped.
Stage 2: Split (document-level)
Each document is converted to PDF (if it’s a DOCX), then split into individual pages saved as PNG images. We use an internal PDF splitting service for this, which handles edge cases like encrypted PDFs, malformed page trees, and oversized documents.
Stage 3: Parse (page-level)
This is where the AI kicks in. Each page image is sent to Claude 3.5 Sonnet (via Amazon Bedrock) for vision-based text extraction. The LLM reads the image and produces clean Markdown text, handling handwritten notes, stamps, tables, and mixed layouts that traditional OCR tools struggle with.
import json
import base64
import boto3
def parse_page(image_bytes: bytes, page_number: int) -> str:
bedrock = boto3.client("bedrock-runtime")
response = bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 4096,
"messages": [{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": base64.b64encode(image_bytes).decode(),
},
},
{
"type": "text",
"text": "Extract all text from this document page. "
"Preserve the structure using Markdown formatting. "
"Include tables, headers, and any handwritten text.",
},
],
}],
}),
)
result = json.loads(response["body"].read())
return result["content"][0]["text"]Why Claude 3.5 Sonnet for OCR instead of Amazon Textract? Two reasons: (1) vision LLMs handle messy real-world documents (stamps, handwriting, mixed layouts) significantly better than traditional OCR, and (2) the output is already structured Markdown, which downstream stages can work with directly without an intermediate parsing step.
Stage 4: Page Checker (document-level)
LLM calls fail. Rate limits, timeouts, transient errors. The Page Checker implements retry logic: it collects results from all page parsing calls, identifies failures, and re-dispatches failed pages up to 3 times. This is essential when processing documents with dozens of pages, where even a 1% failure rate means most documents would have at least one failed page.
def check_pages(document_id: str, page_results: list[dict]) -> dict:
successful = [p for p in page_results if p["status"] == "success"]
failed = [p for p in page_results if p["status"] == "error"]
retryable = [
p for p in failed
if p.get("retry_count", 0) < MAX_RETRIES
]
if retryable:
# Re-dispatch failed pages for another attempt
for page in retryable:
page["retry_count"] = page.get("retry_count", 0) + 1
return {"status": "retry", "pages": retryable}
if not successful:
return {"status": "error", "message": "All pages failed parsing"}
# Proceed with whatever pages succeeded
return {
"status": "success",
"parsed_pages": len(successful),
"failed_pages": len(failed),
}Upper Branch: Embedding Pipeline#
After text extraction, the pipeline forks into two parallel branches. The upper branch creates vector embeddings for semantic search.
Chunker: Combines all page texts into a single document, then splits it into overlapping chunks with metadata (page numbers, positions). Each chunk is saved to S3.
Embedder: Each chunk is embedded using OpenAI’s text-embedding-ada-002 model. We chose ada-002 for its good balance of quality and cost at scale. Embeddings are stored in Aurora PostgreSQL using the pgvector extension, enabling similarity search across all documents in a batch.
This powers a RAG (Retrieval-Augmented Generation) interface where operators can ask questions like “What does section 3 say about the delivery terms?” and get answers grounded in the actual documents.
Lower Branch: Classification and Extraction Pipeline#
The lower branch handles structured information extraction.
Clusterization: A single document PDF often contains multiple logical sections (an executive summary followed by technical specifications followed by appendices). The clusterization stage identifies contiguous page ranges that belong to the same topic. We use Gemini 2.0 Flash for this because it’s fast, cheap, and performs well on the classification-style reasoning this step requires.
def clusterize_document(pages: list[dict]) -> list[dict]:
"""Identify clusters of pages about the same topic."""
# Build context from all pages
page_summaries = "\n".join(
f"Page {p['number']}: {p['text'][:200]}..."
for p in pages
)
prompt = f"""Analyze these document pages and identify clusters
of consecutive pages that discuss the same topic.
Pages:
{page_summaries}
Return a JSON array of clusters, each with:
- start_page: first page number
- end_page: last page number
- topic: brief description of what this section covers
"""
# Call Gemini 2.0 Flash via the LLM Gateway
response = llm_gateway.invoke(
provider="vertex",
model="gemini-2.0-flash",
prompt=prompt,
)
return json.loads(response)Classification: Each cluster is labeled with a document type (report, specification, invoice, appendix, etc.). The classification model uses the cluster’s text content and the batch category to assign the most appropriate label.
Extraction: This is where it gets domain-specific. Based on the classification label, the extraction stage applies a tailored extraction strategy. A report gets title, date, author, and key findings extracted. A specification gets version, scope, and requirements extracted. An invoice gets line items and totals extracted.
The results are stored as structured JSON in DynamoDB, making them queryable and displayable in the web UI.
Cluster Aggregator: Collects results from all clusters, validates completeness, and updates the document’s processing status.
The Multi-Model LLM Strategy#
One of the most interesting architectural decisions was using three different LLM providers, each selected for a specific task:
| Task | Model | Why |
|---|---|---|
| Text extraction (OCR) | Claude 3.5 Sonnet (Bedrock) | Best vision capabilities for messy documents |
| Embeddings | text-embedding-ada-002 (OpenAI) | Cost-effective, high-quality embeddings at scale |
| Clusterization/Classification | Gemini 2.0 Flash (Vertex AI) | Fast and cheap for reasoning tasks |
All LLM calls go through an internal gateway service that abstracts the provider differences. The gateway handles authentication, rate limiting, usage tracking, and fallback logic. From the pipeline’s perspective, it’s just calling an API with a provider and model name.
This multi-model approach lets us optimize for cost and quality per task rather than being locked into a single provider. The text extraction stage is the most expensive (vision + large context), so we use the best model available. The clusterization stage processes much less data and needs speed more than depth, so we use a fast, cheap model.
Concurrency and Error Handling#
The Step Functions state machine uses Map states to process documents and pages in parallel. A single batch might contain 20 documents, each with 50 pages, resulting in 1,000 parallel page-processing Lambda invocations.
Key patterns:
Idempotency: Every Lambda function is idempotent. If a Step Functions execution is retried (due to a transient error), re-processing the same input produces the same result without side effects. We use DynamoDB conditional writes to prevent duplicate processing.
Graceful degradation: If the embedding branch fails for a document, the classification branch still completes (and vice versa). A document with failed embeddings can still have its structured data extracted. The system tracks partial success at every level.
Correlation tracking: Every request gets a correlation ID that flows through all Lambda invocations, S3 objects, and DynamoDB records. When something fails, you can trace the entire processing chain from upload to the specific failed step.
The Processing Service#
The web UI is a Django application running on ECS Fargate, integrated as a microfrontend into the larger analytics platform. Operators can:
- Browse batches and their documents
- View extracted text alongside the original document images
- See structured extraction results (fields, values, confidence)
- Search across all documents using natural language (RAG)
- Generate document summaries on demand
- Track document processing status in real time
Authentication uses Keycloak with JWT tokens. Authorization uses Entra ID groups: each batch is associated with a visibility group, and only members of that group can access its documents. A daily sync job keeps group membership current.
Performance and Cost#
Some numbers from production:
- Processing time: a 30-page document takes approximately 3-5 minutes end-to-end (dominated by LLM call latency)
- Throughput: the system handles 500+ documents per hour during peak periods
- Cost per document: roughly EUR 0.15-0.30, depending on page count and complexity (the bulk of the cost is LLM inference)
- Infrastructure cost when idle: near zero (serverless)
The batch processing approach (every 10 minutes) means we can process multiple documents from the same batch together, which is more efficient than processing each upload individually.
Lessons Learned#
Vision LLMs are production-ready for OCR. Claude 3.5 Sonnet handles real-world documents (stamps, handwriting, poor scans, mixed languages) far better than traditional OCR. The quality improvement justified the higher per-page cost.
Step Functions are the right tool for document pipelines. The built-in retry logic, parallel Map states, error handling, and visual debugging make Step Functions ideal for multi-stage document processing. We tried orchestrating with SQS queues initially, but the complexity of tracking state across stages was not worth it.
Multi-model is the way forward. No single LLM is best at everything. Using Claude for vision, OpenAI for embeddings, and Gemini for fast classification gave us the best cost-quality tradeoff at each stage.
Retry logic is not optional. LLM APIs fail more often than traditional APIs. Rate limits, timeouts, model overload. The Page Checker retry pattern (up to 3 attempts per page) is what makes the pipeline reliable enough for production.
Batch over real-time when you can. Processing uploads every 10 minutes instead of immediately simplified the architecture significantly and reduced costs. For document processing workflows, a few minutes of latency is perfectly acceptable.
What’s Next#
The natural evolution is closing the loop: having the AI agent not just extract and classify documents, but also suggest actions or decisions based on the extracted data and historical patterns. This moves from “AI assists the human” to “AI proposes, human approves,” which is the logical next step for any document-heavy workflow.
References#
Want to go deeper on AI integration, platform engineering, or backend systems? I offer 1:1 coaching sessions tailored to your background and goals. Check out the coaching page.