OpenAI's Privacy Filter vs GLiNER on 600 PII samples

Reddit r/LocalLLaMA / 5/1/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The post compares two open-weight PII text detection models—OpenAI’s privacy-filter (sparse MoE, ~50M active params per forward pass) and GLiNER large-v2.1 (~300M params)—running locally on CPU, finding privacy-filter processes text faster (about 2.8 samples/sec vs ~1.1 for GLiNER).
  • On a 600-sample evaluation set (400 English + 200 multilingual from ai4privacy/pii-masking-400k across six PII categories), strict scoring made openai/privacy-filter look much worse than GLiNER due to a GPT-style BPE token offset mismatch; when using boundary-overlap scoring, privacy-filter’s overall results actually surpass.
  • Category-level results show OpenAI’s privacy-filter performs best on PERSON, EMAIL, PHONE, and DATE (while GLiNER is stronger on ADDRESS), and EMAIL detection is nearly solved for GLiNER (very high F1 in both English and multilingual).
  • Threshold tuning substantially affects GLiNER: the default 0.5 threshold underperforms, while 0.7 yields roughly +8 F1 versus default on this dataset.
  • Practical trade-offs are emphasized: choose GLiNER for higher recall (e.g., where misses are unacceptable) and flexible zero-shot entity types beyond OpenAI’s built-in eight, while choosing privacy-filter for better precision and faster CPU throughput; additionally, privacy-filter requires trust_remote_code=True and an unreleased Transformers dev branch model class.
OpenAI's Privacy Filter vs GLiNER on 600 PII samples

Both models are open weight, both run on a local CPU workstation, both detect PII in text. Quick rundown of what I found.

GLiNER large-v2.1 is ~300M params, zero shot, you pass entity types as plain text strings at inference.

Openai/privacy-filter is 1.5B total but only 50M active per forward pass thanks to a sparse MoE.

In practice on CPU openai/privacy-filter ran ~2.8 samples/sec vs ~1.1 for GLiNER large.

Eval was 400 English + 200 multilingual samples from ai4privacy/pii-masking-400k, six PII categories.

The catch: openai/privacy-filter uses GPT style BPE tokenization, which prepends a space to most tokens. So when you decode token offsets back to character spans, everything is off by one character. Score with strict exact match and openai/privacy-filter looks awful. Score with boundary overlap (any character overlap, correct label) and it actually wins overall.

English macro F1:

Model Strict Boundary Partial
GLiNER large-v2.1 0.367 0.416 0.392
openai/privacy-filter 0.155 0.498 0.326

The 0.34 strict-to-boundary gap for openai/privacy-filter is entirely tokenizer offset, not real misses.

Per category on boundary, openai/privacy-filter wins PERSON, EMAIL, PHONE, DATE. GLiNER wins ADDRESS. EMAIL is essentially solved (0.987 English, 1.000 multilingual).

GLiNER threshold tuning matters. Default 0.5 is leaving F1 on the table. 0.7 was the best for this dataset, ~8 F1 better than default.

If you want recall above all (eg redaction where misses are unacceptable), GLiNER. If you want precision and faster CPU throughput, openai/privacy-filter. If you need custom entity types beyond the eight openai/privacy-filter ships with, GLiNER's zero shot interface is the only option.

One annoyance worth knowing: openai/privacy-filter requires trust_remote_code=True and the dev branch of transformers. The model class hasn't landed in a stable release yet.

Full numbers, multilingual breakdown, the threshold sweep, all the code in comments below 👇

Disclosure: I work on Neo AI Engineer, and the eval pipeline was built and executed by Neo from a single prompt. I reviewed the methodology and validated the results before publishing. The numbers and findings stand on their own.

submitted by /u/gvij
[link] [comments]