Back to blog
Architecture Presidio Developer Edition PII Protection

Adding Reversible Tokenization to a Presidio Pipeline with Protegrity AI Developer Edition

A practical migration from straightforward redaction to reversible, format-preserving protection so downstream AI workflows could still reason over tickets without ever seeing real PII.

Migration shape
Processor swap
Main gain
Reversible tokenization
Downstream effect
AI keeps useful signal

I've been using Microsoft Presidio for PII detection in a CLI tool that sanitizes support tickets. It works well. Point it at text, get back detected entities, redact or mask as needed. For a side project processing customer support tickets (names, emails, phone numbers, addresses, and potentially more sensitive identifiers), I had a working pipeline in about an hour. Presidio's analyzer is accurate, the recognizer registry is extensible, and for straightforward detection-and-redact workflows, it just works.

But I hit a wall when my use case evolved.

Where I needed more than redaction

The support ticket pipeline started simple: detect PII, redact it, pass sanitized tickets downstream for analysis. But once we started routing tickets with an AI model, the redacted output became a problem.

When every customer is <PERSON> and every phone number is <PHONE_NUMBER>, the model can't distinguish between tickets. You can't correlate a repeat caller. You can't spot patterns across a customer's history. The data is protected, yes, but it's also flattened into uselessness for anything beyond single-ticket processing.

What I actually needed was reversible, format-preserving protection. Values that look structurally valid (so downstream systems don't choke), that are consistent (same input produces the same token), and that authorized users can reverse when needed.

Presidio gets you part of the way there. It can redact, mask, replace, hash, and even encrypt detected PII. Encryption is reversible, but it is not the same as enterprise tokenization. What I needed was format-preserving, consistent, policy-controlled protection, where protected values still look structurally valid and authorized users can reverse them through governed access. Presidio can be extended, but that capability is not what it provides out of the box.

Enter Protegrity AI Developer Edition

What caught my attention: Protegrity AI Developer Edition actually uses Presidio internally as one of its detection models. So the detection accuracy I was already relying on carries over. Developer Edition layers additional ML and pattern-matching models on top.

The difference is what happens after detection. Instead of redaction, Developer Edition applies tokenization: format-preserving, reversible, policy-controlled. A name becomes a different name-shaped string. A phone number becomes a different valid-looking phone number. The structure survives, but the real values don't.

What the migration looked like

I had already isolated my Presidio code behind a processor abstraction (BaseProcessor with a process_text() method), so the swap was mechanical:

Before, presidio_processor.py (117 lines):

  • Manual AnalyzerEngine setup
  • spaCy model download (560MB en_core_web_lg)
  • Custom recognizer registration
  • Separate AnonymizerEngine pass
  • Manual result mapping to my data model

After, protegrity_processor.py (62 lines):

  • One find_and_protect() API call
  • Detection + protection in a single pass
  • No local model management

The rest of the app (workflow logic, data models, reporters, I/O) was unchanged. Six files were touched total, with zero business logic rewritten.

Output comparison

Presidio redaction:
Customer: <PERSON>
Email: <EMAIL_ADDRESS>
Phone: <PHONE_NUMBER>
Account Number: <ACCOUNT_NUMBER>
Developer Edition tokenization:
Customer: [PERSON]Pf8q4 kLXJbD7[/PERSON]
Email: [EMAIL_ADDRESS][email protected][/EMAIL_ADDRESS]
Phone: [PHONE_NUMBER](157) 557-5056[/PHONE_NUMBER]
Account Number: [ACCOUNT_NUMBER]81124662[/ACCOUNT_NUMBER]

The second version is still usable data. An AI model can distinguish between customers, route tickets, and analyze patterns. It just never sees real PII.

The capability I gained: authorized reversal

Once protection is tokenization rather than redaction, you can reverse it. Selectively. Developer Edition's find_and_unprotect() call recognizes entity tags, sends tokens to the protection service, and returns originals. But only if your role allows it.

In my pipeline:

  • An ai-support-agent role can process and route the protected ticket. It cannot reverse.
  • A human-support-agent role can call unprotect and recover the original values.

This is policy-controlled, not code-controlled. In production, who can detokenize is a centralized policy decision. For the demo I simulate the role gate in the processor, but the enforcement model is the same.

With Presidio's redaction, this isn't possible because the original value is removed from the output. Presidio encryption can be reversed, but redaction, masking, replacement, and hashing are not reversible.

Practical notes if you're considering this

Keep your Presidio abstraction. If your code imports AnalyzerEngine directly in business logic, you'll have a harder time. Push all Presidio interaction into one module with a clean interface. My abstract base class with process_text() -> ProcessResult made the swap trivial.

You lose local-only operation. Presidio runs entirely locally. Developer Edition calls an API endpoint. If air-gapped local processing is a hard requirement, that's a real tradeoff.

Detection coverage expands. Presidio's detection is one of multiple models in Developer Edition. I noticed it catching entity types I hadn't written custom recognizers for.

spaCy dependency goes away. No more managing a 560MB model download in CI/CD, and no more version conflicts with other NLP libraries in your stack.

Summary

Presidio gave me solid PII detection and got the project off the ground fast. When I needed format-preserving tokenization with governed, role-based reversal, Protegrity AI Developer Edition picked up where Presidio left off while keeping Presidio's detection under the hood. The migration was an afternoon of work, mostly because I'd kept the Presidio code isolated.

The repo has a presidio branch (baseline) and a protegrity branch (migrated). Run git diff presidio..protegrity --stat to see exactly what changed, or browse the project directly on GitHub.