A fintech startup I consulted for last year had a compliance requirement: redact customer PII from documents before sending them to their AI analytics service. Simple enough in principle.
Their first implementation: a Python script that called Adobe Acrobat on each PDF, then waited for a human to manually mark the redactions, then saved the result. The "automation" was a folder watcher that opened documents one at a time.
Processing time: 15 minutes per document, assuming someone was at the keyboard. At 200 documents per day, they needed two full-time employees just doing redaction. The AI service that was supposed to save them time? It sat idle while documents queued.
Three months later, they replaced the manual step with an API call. Same folder watcher, but instead of opening Acrobat, it sent the document to a redaction API. Processing time dropped to 8 seconds per document. Those two employees moved to actual compliance work instead of drawing black boxes.
This is the difference between "we have redaction" and "we have automated redaction." One scales. The other becomes a bottleneck the moment volume increases.
Let me show you how to build the API version.
The short version: If you need to redact sensitive documents before they reach AI systems, PaperVeil handles that layer. The rest of this article explains where it fits in the broader governance architecture.
Why Manual Redaction Falls Apart at Scale
Manual redaction works fine when you have five documents. It doesn't work when you have five hundred.
The time math doesn't add up. A trained person can thoroughly redact maybe 20 to 30 pages per hour. A 50-page contract takes two hours. Ten contracts per day? That's a full-time job for one type of document.
Consistency degrades. The person redacting document 47 catches less than document 1. Different people redact differently. Tuesday's redaction differs from Friday's. The tired 4pm version of your compliance officer isn't as thorough as the caffeinated 9am version.
Humans miss things. The SSN on page 43. The email address in the footer. The phone number in the scanned image attachment. Manual review catches most sensitive data, but "most" isn't good enough when compliance is at stake. "Most" doesn't hold up in a regulatory inquiry.
Everything waits in queue. Every document sits there until a human processes it. Workflows stall. AI tools sit idle. The business value you were trying to capture delays while someone draws rectangles.
What You Get When You Automate
Speed that actually scales. Process documents in seconds, not hours. Hundreds of pages per minute rather than per day. That fintech startup went from 15 minutes to 8 seconds. Same input, same output, 99% less time.
Consistency you can prove. Same detection rules, every document, every time. No variation based on fatigue or attention. Document 500 gets the exact same treatment as document 1.
Coverage that doesn't tire. Automated detection checks every page, every line, every pattern. It runs OCR on images to find text humans would miss entirely. It doesn't get bored at page 40 and start skimming.
Integration with everything else. Redaction becomes a step in larger workflows. Documents flow through automatically, emerging sanitized and ready for downstream use. No human in the loop unless you actually need one.
Audit trails for free. Programmatic processing generates logs automatically. What was detected, what was redacted, when it happened, what rules applied. Try generating that documentation from "a person looked at it."
Architecture Patterns (Pick the One That Fits)
The Simple One: Synchronous API Calls
Send document, wait for response:
Client → API Request (document + config) → Wait → Response (redacted document)
How it works:
- Client sends document and redaction configuration
- API processes document (OCR if needed, detection, redaction)
- API returns redacted document in response
- Client receives and continues workflow
This is the one you should start with. It's the simplest to implement, debug, and understand.
Works well for: Interactive applications where a user uploads and waits. Small to medium documents under 50 pages. Situations where you need the result before continuing.
Breaks down when: Documents are huge (HTTP timeouts at 30 to 60 seconds), volume gets high enough that waiting becomes a problem, or you need to process documents without blocking other work.
The Scalable One: Async Processing with Polling
For larger documents or high-volume processing:
Client → Submit Job → Receive Job ID
Client → Poll Status (with Job ID) → Processing...
Client → Poll Status → Complete
Client → Download Result
You submit the document, get back a job ID, and check periodically until it's done. The API does its work in the background.
Works well for: Large documents (100+ pages), batch processing, workflows where you don't need immediate results, systems where you want to decouple submission from retrieval.
Implementation notes: Use exponential backoff for polling (don't hammer the status endpoint every 100ms). Handle job expiration (results aren't available forever). Consider webhooks instead if you're polling a lot.
The Event-Driven One: Webhooks
Let the API tell you when it's done instead of asking repeatedly:
Client → Submit Job → Receive Job ID
(processing happens)
API → Webhook POST to Client → Job complete notification
Client → Download Result
You submit the document with a callback URL. When processing finishes, the API POSTs to your URL.
Works well for: Event-driven architectures, serverless implementations, systems where polling feels wasteful, complex multi-step workflows where you want notifications to trigger next steps.
Implementation notes: Your webhook endpoint needs to be publicly accessible. Implement signature verification so random internet traffic can't fake completion notifications. Handle retries (webhooks fail sometimes). Design for idempotency (you might get the same webhook twice).
The Production One: Streaming Pipeline
Continuous document processing at scale:
Document Source → Ingestion Queue → Redaction Worker → Output Queue → Destination
Documents arrive in an S3 bucket or message queue. Workers pull documents, call the redaction API, place results in an output location. Downstream systems consume from there.
Works well for: Continuous document flows, high-volume production systems, enterprise integration where documents arrive constantly.
Implementation notes: You'll need proper error handling and dead letter queues. Retry policies for transient failures. Monitoring and alerting. Scalability through worker parallelization. This is the most complex pattern but handles the most volume.
Building with PaperVeil API
PaperVeil provides a document redaction API designed for integration into automated workflows.
What the API Looks Like
Endpoint: POST /api/redact
Authentication: API key in header
Request structure:
{
"document": "<base64 encoded PDF>",
"config": {
"detect": {
"personName": true,
"email": true,
"phone": true,
"ssn": true,
"creditCard": true,
"address": true,
"dateOfBirth": true
},
"customPatterns": [
{
"name": "AccountNumber",
"pattern": "ACC-\\d{8}"
}
],
"customTerms": ["Acme Corporation", "Project Phoenix"],
"removeLogos": true
}
}
You tell it what PII types to detect, add any custom patterns specific to your organization, and optionally specify terms or logos to remove.
Response structure:
{
"status": "complete",
"document": "<base64 encoded redacted PDF>",
"manifest": {
"totalDetections": 47,
"byType": {
"personName": 12,
"email": 8,
"phone": 5,
"ssn": 2,
"address": 15,
"customPattern:AccountNumber": 3,
"customTerm:AcmeCorporation": 2
},
"pages": [
{
"page": 1,
"detections": [
{
"type": "personName",
"value": "John Smith",
"coordinates": {"x": 120, "y": 340, "w": 130, "h": 25}
}
]
}
]
},
"metadata": {
"processingTime": 2340,
"pagesProcessed": 12,
"ocrApplied": true
}
}
You get back the redacted document, a manifest of everything found and removed, and metadata about processing. The manifest is your audit trail.
The Python Version
import requests
import base64
def redact_document(file_path, api_key):
# Read and encode document
with open(file_path, 'rb') as f:
document_b64 = base64.b64encode(f.read()).decode()
# Configure redaction
payload = {
"document": document_b64,
"config": {
"detect": {
"personName": True,
"email": True,
"phone": True,
"ssn": True,
"creditCard": True,
"address": True,
"dateOfBirth": True
}
}
}
# Call API
response = requests.post(
"https://api.paperveil.com/redact",
json=payload,
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
)
result = response.json()
# Save redacted document
redacted_bytes = base64.b64decode(result["document"])
output_path = file_path.replace(".pdf", "_redacted.pdf")
with open(output_path, 'wb') as f:
f.write(redacted_bytes)
return result["manifest"]
That's the whole thing. Read file, encode it, call API, decode result, save. Everything complex (OCR, entity detection, pattern matching, coordinate mapping) happens on the other end.
The Node.js Version
const axios = require('axios');
const fs = require('fs');
async function redactDocument(filePath, apiKey) {
// Read and encode document
const documentBuffer = fs.readFileSync(filePath);
const documentB64 = documentBuffer.toString('base64');
// Configure redaction
const payload = {
document: documentB64,
config: {
detect: {
personName: true,
email: true,
phone: true,
ssn: true,
creditCard: true,
address: true,
dateOfBirth: true
}
}
};
// Call API
const response = await axios.post(
'https://api.paperveil.com/redact',
payload,
{
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
}
}
);
// Save redacted document
const redactedBuffer = Buffer.from(response.data.document, 'base64');
const outputPath = filePath.replace('.pdf', '_redacted.pdf');
fs.writeFileSync(outputPath, redactedBuffer);
return response.data.manifest;
}
Same logic, different language. The API doesn't care what you're calling it from.
Integration Patterns for Real Workflows
Email Attachment Redaction with n8n
n8n lets you build visual workflows that integrate with document APIs:
[Email Trigger] → [Extract Attachments] → [Filter PDFs] → [HTTP Request: [PaperVeil](/products/paperveil) API] → [Save to Drive] → [Reply with Link]
When an email arrives with a PDF attachment, the workflow extracts it, sends it to the redaction API, saves the sanitized version to Google Drive, and replies with a link. No code required beyond configuring the HTTP Request node.
This is useful for teams that receive sensitive documents by email and need to process them before sharing or analyzing.
Watch Folder with AWS Lambda
Serverless document processing triggered by S3 uploads:
S3 (input bucket) → Lambda (trigger) → [PaperVeil](/products/paperveil) API → Lambda → S3 (output bucket)
↓
DynamoDB (manifest storage)
import boto3
import requests
import base64
import json
import os
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
def handler(event, context):
# Get document from S3
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
response = s3.get_object(Bucket=bucket, Key=key)
document_bytes = response['Body'].read()
document_b64 = base64.b64encode(document_bytes).decode()
# Call [PaperVeil](/products/paperveil) API
result = requests.post(
"https://api.paperveil.com/redact",
json={
"document": document_b64,
"config": {
"detect": {
"personName": True,
"email": True,
"ssn": True
}
}
},
headers={"Authorization": f"Bearer {os.environ['PAPERVEIL_API_KEY']}"}
).json()
# Save redacted document
redacted_bytes = base64.b64decode(result["document"])
output_key = key.replace("input/", "output/").replace(".pdf", "_redacted.pdf")
s3.put_object(Bucket=bucket, Key=output_key, Body=redacted_bytes)
# Store manifest for audit trail
table = dynamodb.Table('redaction-manifests')
table.put_item(Item={
'documentKey': key,
'manifest': json.dumps(result["manifest"]),
'timestamp': context.get_remaining_time_in_millis()
})
return {"statusCode": 200, "body": json.dumps({"output": output_key})}
Drop a PDF in the input folder, get a redacted version in the output folder. The manifest goes to DynamoDB for compliance records.
Redact Before Sending to ChatGPT
The most common use case: sanitize documents before AI processing.
import openai
from paperveil import redact_document
from pdf_utils import extract_text
def analyze_document_safely(pdf_path, prompt):
# Step 1: Redact sensitive content
redacted_pdf = redact_document(pdf_path, {
"detect": {
"personName": True,
"email": True,
"phone": True,
"ssn": True,
"address": True
}
})
# Step 2: Extract text from redacted PDF
sanitized_text = extract_text(redacted_pdf)
# Step 3: Send to LLM
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Analyze the following document."},
{"role": "user", "content": f"{prompt}\n\nDocument:\n{sanitized_text}"}
]
)
return response.choices[0].message.content
Document goes to redaction API first. Redacted text goes to the LLM. Sensitive data never leaves your environment. The AI gets content it can analyze. You get insights without compliance risk.
When Things Go Wrong (And How to Handle It)
Common Error Scenarios
Document problems: Corrupted PDFs, password-protected files, unsupported formats. The API will tell you what went wrong. Your code needs to handle the error gracefully rather than crashing.
API problems: Rate limiting (429), authentication failures (401), service unavailable (503). Transient failures happen. Don't treat them as permanent.
Detection problems: OCR failures on poor quality scans, pattern mismatches on unusual formats. Sometimes the input is too degraded to process reliably.
Retry Logic That Actually Works
import time
from functools import wraps
def retry_with_backoff(max_retries=3, base_delay=1):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
retries = 0
while retries < max_retries:
try:
return func(*args, **kwargs)
except Exception as e:
if retries == max_retries - 1:
raise
delay = base_delay * (2 ** retries)
time.sleep(delay)
retries += 1
return wrapper
return decorator
@retry_with_backoff(max_retries=3)
def redact_with_retry(document_path):
return redact_document(document_path)
First retry after 1 second. Second retry after 2 seconds. Third retry after 4 seconds. Then give up. This handles most transient failures without hammering the API.
Circuit Breaker for High Volume
If you're processing thousands of documents and the API starts failing, you don't want thousands of retry loops piling up:
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.last_failure_time = None
self.state = "closed"
def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = "half-open"
else:
raise Exception("Circuit breaker is open")
try:
result = func(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
raise
After 5 consecutive failures, the circuit "opens" and all requests fail fast for 60 seconds. This gives the API time to recover without your system making things worse.
What to Monitor
Throughput: Documents processed per hour, pages per hour, API calls per minute. These tell you if you're keeping up with volume.
Latency: Average processing time, P95 processing time, time by document size. These tell you if something's slowing down.
Quality: Detection counts by type, false positive reports, failed processing rate. These tell you if detection is working correctly.
Cost: API credits consumed, cost per document, cost per detection type. These tell you if you're staying within budget.
Log everything in structured JSON so you can query it later:
logger.info(json.dumps({
"event": "redaction_completed",
"request_id": request_id,
"detections": result["manifest"]["totalDetections"],
"processing_time_ms": result["metadata"]["processingTime"]
}))
Security (Don't Skip This)
API keys go in environment variables or secrets managers. Never commit them to version control. Rotate them periodically. Use separate keys for dev, staging, and production.
Documents in transit use HTTPS. TLS 1.2 at minimum. This isn't optional.
Delete temporary files after processing. Don't leave unredacted documents sitting in /tmp.
Never log document content. Log metadata, processing times, detection counts. Not the actual text or the actual PII values.
The Bottom Line
Document redaction at scale requires automation. Manual processing doesn't keep up with modern document volumes, and consistency degrades as humans tire. Two employees drawing black boxes is not a scalable solution.
API-based redaction provides speed (seconds instead of hours), consistency (same rules, every document), integration (documents flow through workflows automatically), and auditability (processing generates logs automatically).
The architecture pattern you choose depends on your scale. Start with synchronous calls for prototyping. Move to async patterns as volume grows. The API stays the same; only the orchestration changes.
PaperVeil handles the complexity: OCR for scanned content, NER for entity detection, pattern matching for structured data. Your integration code stays simple. A few dozen lines to go from manual redaction to automated pipeline.
Document redaction isn't a manual task anymore. It's an API call in your pipeline, running automatically, consistently, and at whatever scale you need.
PaperVeil lets you redact all your sensitive information from pdfs in a simple drag and drop flow. Detect and remove PII, match custom patterns, strip metadata, and generate audit trails. The redaction layer that makes AI document processing actually safe.