How to Redact a PDF Before Uploading to ChatGPT or Claude

In 2011, the Transportation Security Administration published a "redacted" airport security manual. They'd drawn black boxes over the sensitive parts using what looked like the highlighting tool in their PDF editor.

Someone copied and pasted the text. All of it. The full content was sitting right there under the black boxes. Sensitive security procedures for every US airport became public because someone thought "black box over text" equals "redacted."

In 2014, a law firm filed a court document with "redacted" financial figures. Same technique: black highlighting. Opposing counsel extracted the hidden numbers in about thirty seconds.

This happens constantly because most people don't understand what redaction actually means. They think they've removed sensitive information. They've actually just covered it with a digital sticker that anyone can peel off.

If you're about to upload a document to ChatGPT or Claude, and you think you've "redacted" it by drawing black boxes, you might want to keep reading.

The short version: If you need to redact sensitive documents before they reach AI systems, PaperVeil handles that layer. The rest of this article explains where it fits in the broader governance architecture.

What Redaction Actually Means

Let me be very clear about something: there's a difference between hiding content and removing it.

Redaction is NOT:

  • Drawing black boxes over text with annotation tools
  • Changing text color to match the background
  • Covering text with images or shapes
  • Using the highlighter tool in black

All of these methods leave the original text sitting in the PDF file. The visual layer hides it from your eyes. The data layer keeps it perfectly intact. Anyone with basic PDF knowledge can recover it.

Redaction IS:

  • Permanently removing text and image content from the PDF
  • Replacing removed content with solid fills (the black boxes you see)
  • Eliminating the underlying data from the file structure
  • Irreversible: the original content cannot be recovered

True redaction modifies the PDF itself, not just how it looks. When done correctly, the sensitive data is gone. Not hidden. Gone.

Why This Matters for AI

When you upload a document to ChatGPT, Claude, or any LLM:

  1. The document leaves your network
  2. It's transmitted to the AI provider's servers
  3. It may be stored (at least temporarily)
  4. It might be used for training (depending on settings)
  5. You can't take it back

For documents containing Social Security numbers, client names, medical information, or confidential business data, this creates real problems:

  • HIPAA violations for health data
  • GDPR violations for EU personal data
  • Potential privilege waiver for legal content
  • Breach of confidentiality for client information
  • Competitive exposure for proprietary content

If you think you've redacted a document but you actually just drew boxes on it, you've sent all that sensitive data to an external server with a false sense of security.

Method 1: Adobe Acrobat Pro (The Way Most People Do It)

Adobe Acrobat Pro DC includes dedicated redaction tools. Here's how to use them correctly.

Step 1: Open the Redaction Tool

  1. Open your PDF in Acrobat Pro DC
  2. Go to Tools → Redact
  3. The redaction toolbar appears at the top

Step 2: Mark Content for Redaction

For text:

  • Click "Mark for Redaction"
  • Click and drag to select text
  • Selected content gets a red overlay (not yet redacted)

For images:

  • Click "Mark for Redaction"
  • Draw a rectangle around image areas
  • Entire regions within the rectangle will be removed

For patterns like SSN or phone numbers:

  • Click "Mark for Redaction" → "Find Text"
  • Use "Patterns" to find SSNs, phone numbers, email addresses
  • Select all matches and mark for redaction

Step 3: Apply Redaction

  1. Click "Apply" in the toolbar
  2. Acrobat warns that this action is permanent
  3. Confirm to permanently remove marked content
  4. Save the file with a new name (keep your original)

Step 4: Don't Skip This Part

Remove Hidden Information. This is where most people fail.

  1. Go to Tools → Redact → "Remove Hidden Information"
  2. This strips metadata, hidden layers, comments, and embedded data
  3. Without this step, sensitive info might still be lurking in the file structure

The Problems with Acrobat

While Acrobat Pro is the industry standard, it has real usability issues:

It's painfully slow. Each file must be opened, marked, applied, cleaned, and saved individually. Processing 20 documents takes an hour of clicking.

Pattern detection is limited. Built-in patterns cover SSN, phone, and email. Custom patterns require regex knowledge that most people don't have.

No batch processing. You can't select a folder of PDFs and redact them automatically. Every document needs manual attention.

Scanned PDFs need OCR first. For image-based PDFs, you must run text recognition, then redact, hoping the OCR caught everything.

Cost. Acrobat Pro DC runs $22.99/month. For occasional use, that's expensive.

The interface is confusing. First-time users regularly use annotation tools instead of redaction tools and end up with fake redaction.

Method 2: Free Online Tools (Please Don't)

Several online services offer free PDF redaction: Smallpdf, PDF24, Sejda.

Here's the problem: you're uploading sensitive documents to a third party in order to protect them from third parties.

When you upload a PDF to Smallpdf for redaction, that document travels to their servers. You're trusting their data handling before you've even removed the sensitive information.

For documents sensitive enough to need redaction before AI processing, uploading the un-redacted version to a different cloud service defeats the entire purpose.

Also:

  • Quality varies wildly. Some don't do true redaction, just visual overlay.
  • File size limits on free tiers
  • No pattern detection. Manual selection only.
  • No way to verify the redaction actually worked

If your document truly contains sensitive information, free online tools aren't the answer.

Method 3: Desktop Applications

Several desktop apps handle redaction without cloud uploads:

PDF-XChange Editor: Windows, one-time purchase ($56), includes redaction tools.

Foxit PDF Editor: Cross-platform, $149/year subscription, enterprise features.

LibreOffice Draw: Free and open source. Warning: NOT true redaction. The underlying text remains.

Preview (Mac): Built into macOS. Warning: Also NOT true redaction. Don't trust it.

The desktop approach keeps files local but typically has the same usability problems as Acrobat: manual selection, no batch processing, limited pattern detection.

Method 4: Automated Redaction (The Modern Approach)

The limitations of traditional redaction become unworkable when you're processing documents regularly for AI workflows.

Automated redaction tools take a different approach:

  1. Upload the document (single file or batch)
  2. Select what to detect (PII types, patterns, custom terms)
  3. Execute redaction (automatic detection and removal)
  4. Download clean file (ready for AI processing)

What Automated Detection Covers

PII Categories:

  • Person names (detected through machine learning models)
  • Email addresses
  • Phone numbers (multiple formats)
  • Social Security Numbers
  • Credit card numbers
  • Street addresses
  • Dates of birth

Pattern Matching:

  • Custom regex patterns (your account number formats, internal IDs)
  • Company names and logos
  • Specific terms or phrases you define

Mixed Content:

  • Text layers in native PDFs
  • OCR for scanned documents
  • Text embedded in images
  • Handwritten content (with limitations)

PaperVeil: Built for This

PaperVeil is a redaction tool designed specifically for preparing documents for LLM processing.

The interface is straightforward:

  1. Upload your PDF. Drag and drop or click to select.

  2. Choose what to redact:

    • Toggle PII types: Person Name, Email Address, Phone Number, Social Security Number, Credit Card Number, Street Address, Date of Birth
    • Add custom text or regex patterns
    • Specify logo text to remove (company names in headers/footers)
  3. Execute Redaction. Click the button.

  4. Review the Output Manifest. See exactly what was detected and removed.

  5. Download the clean PDF. Ready for ChatGPT, Claude, or any LLM.

Why This Works Better

Speed: What takes 30 minutes manually takes 30 seconds automatically. Upload, click, done.

Consistency: Every document processed the same way. No "I forgot to check that section" errors.

Coverage: Automated detection catches things humans miss. The SSN in the footer. The email in the signature block. The phone number in the scanned letterhead.

Auditability: The output manifest shows exactly what was found and removed. You have a record for compliance.

Mixed media: Scanned documents, image-based PDFs, and documents with embedded graphics all get processed correctly.

Choosing the Right Method

SituationRecommended Approach
One-off document, minimal sensitive dataAdobe Acrobat Pro (if you have it)
Occasional use, budget-consciousDesktop app like PDF-XChange
Regular AI workflow, multiple documentsAutomated tool (PaperVeil)
Highly sensitive documentsAutomated + manual review
Documents with scanned contentAutomated with OCR capability

For most people preparing documents for AI analysis, automated redaction provides the best balance of security, speed, and reliability.

The Complete Workflow: Redact, Then AI

Let me walk through the complete process:

Step 1: Assess Your Document

Before redacting, identify what needs to go:

  • Personal information: Names, contact details, IDs
  • Financial data: Account numbers, amounts (if sensitive)
  • Company identifiers: Names of parties in contracts
  • Dates: If they identify specific individuals
  • Custom data: Industry-specific identifiers

Step 2: Choose Detection Settings

For a typical contract going to AI for summarization:

Enable:

  • Person Name
  • Email Address
  • Phone Number
  • Street Address

Custom patterns:

  • Company names in the agreement
  • Case numbers or reference IDs

Step 3: Process the Document

Upload to your redaction tool and run detection. Review what was found:

Detection Results:
- 12 person names found
- 3 email addresses found
- 2 phone numbers found
- 4 street addresses found
- "Acme Corporation" found 8 times

Step 4: Verify the Output

Open the redacted PDF and spot-check:

  • Are black boxes where expected?
  • Does the document still make sense?
  • Is context preserved for AI analysis?

Step 5: Upload to AI

Your document is now safe for LLM processing. The AI receives something like:

"Agreement between [COMPANY] and [PERSON], dated [DATE], for provision of consulting services at [ADDRESS]. Payment terms: Net 30. Total contract value: $50,000..."

The LLM can summarize the agreement, extract key terms, answer questions about content, and compare against other documents.

But it can't identify the parties, link to real individuals, or expose confidential relationships.

Mistakes to Avoid

Using Annotation Instead of Redaction

Drawing black boxes with comment tools doesn't remove underlying text. Always use dedicated redaction features and verify by trying to select text under the boxes.

Forgetting Metadata

PDFs contain author names, organization info, creation dates, and revision history. Run "Remove Hidden Information" after redacting visible content.

Missing Headers and Footers

Letterhead, page footers, and running headers often contain company names and contact info. Easy to overlook when focusing on body text.

Ignoring Image-Based Content

Scanned documents and embedded images require OCR-based redaction. Standard text redaction won't touch them.

Over-Redacting

Removing too much makes documents useless for analysis. Redact identifying information, but keep context the AI needs.

Not Keeping Originals

Always save the original unredacted document securely. Redaction is irreversible. You may need originals for legal or business purposes.

Integration with AI Workflows

For organizations processing documents regularly, redaction should be part of the pipeline:

Document Received (email, upload, API)
       ↓
[Automated Redaction]
       ↓
Sanitized Document
       ↓
[LLM Processing]
- Summarization
- Extraction
- Classification
       ↓
Results Delivered

This can be automated with tools like n8n or Zapier:

  1. Trigger: New email with PDF attachment
  2. Action 1: Send PDF to PaperVeil API for redaction
  3. Action 2: Send redacted PDF to OpenAI/Claude for analysis
  4. Action 3: Deliver results to user or downstream system

The manual step of opening each PDF and redacting disappears. Documents flow through the pipeline automatically.

The Bottom Line

PDF redaction isn't complicated once you understand the fundamentals:

  1. True redaction removes data, not just hides it
  2. Traditional tools work but require manual effort for each document
  3. Automated solutions handle detection, batch processing, and mixed content
  4. The workflow matters because redaction should be a step in your pipeline, not a separate task

For occasional use, Adobe Acrobat Pro gets the job done. For regular AI document processing, automated redaction saves hours while reducing the risk of missed sensitive data.

Your documents have information AI can help with. With proper redaction, that information flows to the AI while the sensitive details stay protected.


PaperVeil lets you redact all your sensitive information from pdfs in a simple drag and drop flow. Detect and remove PII, match custom patterns, strip metadata, and generate audit trails. The redaction layer that makes AI document processing actually safe.