The pain I watched up close
I'm a cloud infrastructure engineer, 6 years in. I've watched engineers on AI product teams go through the same painful cycle more times than I can count.
They'd store reference data carefully. Feed it to the AI. The AI would still hallucinate. Wrong numbers. Wrong context. Said with total confidence. Then someone would spend three hours tracing back through sources by hand, trying to figure out where the AI pulled its "facts" from. I sat next to those people. I saw their faces.
I do the same thing myself. I tell the AI, "go find trustworthy sources on the internet and verify this for me." And then I wonder if the answer is actually backed by anything. Every time.
At some point I stopped blaming the AI and started blaming the tool. Why are we forcing AI to browse the web like a human? Humans read visually. AI doesn't. What if there was a browser built for how AI actually processes information?
I built touch-browser to find out. Used Codex to do it, which felt right: building an AI browser with an AI. First open source project. No idea if it's useful. Would love to hear what you think.
GitHub: nangman-infra/touch-browser
I asked around first
I talked to people. Read forums. Lurked in Discord channels. Before I wrote a line of code, I wanted to know if this problem was just mine or if other people were hitting the same wall.
Turns out they were:
"AI confidently cites sources that don't say what it claims." Engineers ask AI to verify a spec. They get a confident answer with a URL. The page says something completely different.
"I can't trust AI research without manually checking every source." You spend 30 minutes verifying what AI told you in 5 seconds. What's the point?
"AI agents click things they shouldn't." Agentic workflows where AI follows malicious instructions embedded in web pages. Clicks payment buttons. Fills forms on hostile sites.
"There's no audit trail." AI gives you a research summary. No way to trace which paragraph on which page led to which conclusion.
What's already on the market
I looked at every browser-related AI tool I could find. Three categories:
Category 1: Markdown scrapers (Exa, Firecrawl, Jina Reader)
- Convert web pages to clean markdown for AI consumption
- Great at extraction, zero verification
- No concept of "does this content support the claim?"
Category 2: Browser automation (Playwright MCP, Puppeteer MCP, Browserbase)
- Give AI agents control over a real browser — click, type, navigate
- Powerful for automation tasks (filling forms, testing)
- No evidence scoring, no safety policy, no citation tracking
Category 3: Computer use / full screen control (Anthropic Computer Use, OpenAI Operator)
- AI sees the actual screen and controls mouse/keyboard
- Most powerful but also most expensive and highest risk
- Anthropic themselves warn about prompt injection risks in this mode
What was missing: Category 0
None of these verify what the AI reads. They all focus on how to get content (scrape it, automate it, see it) but not whether that content is trustworthy.
It's like building a car with a great engine but no brakes. The faster you go, the more dangerous it gets.
Why didn't this exist?
I think there are structural reasons:
The browser was designed for humans. Every existing tool starts from "how do we give AI access to a human browser?" instead of "what would AI actually need from the web?"
Verification is hard to sell. "We fetch pages 10x faster" is easy to benchmark. "We verify whether the content supports your claim" requires defining what verification even means — and there's no standard for that yet.
The MCP ecosystem was automation-first. When MCP launched, the first browser tools naturally focused on the most obvious use case: automating what humans do. Evidence verification is a different mental model entirely.
Nobody combined the layers. Some tools do safety (sandboxing, permissions). Some do extraction (markdown, accessibility tree). Some do research (multi-page). But combining evidence scoring + safety policy + research sessions + citations in one runtime? That intersection didn't have a product.
So I decided to build it.
What I designed: 4 core capabilities
Before writing code, I listed what each pain point actually needed:
- AI cites sources that don't match → Evidence Engine scores claims against page content (
core/crates/evidence) - Can't trust without manual verification → Structured Contracts so every output follows a JSON Schema (
contracts/schemas, 15 schemas) - Agents click dangerous things → Policy Kernel classifies and blocks hostile content (
core/crates/policy) - No audit trail for research → Session Memory for multi-tab research with synthesis (
core/crates/memory)
Then the supporting pieces:
-
Observation (
core/crates/observation) normalizes raw DOM into structured blocks with stable refs. Messy web HTML goes in, clean scorable data comes out. -
Acquisition (
core/crates/acquisition) handles fetching, redirects, caching. Nothing enters the domain unchecked. -
Action VM (
core/crates/action-vm) runs typed actions (click, type, submit) with a failure taxonomy. No silent failures. -
Contracts (
contracts/schemas) are the published language. 15 JSON Schemas. No tool returns free-form text.
I organized it as DDD bounded contexts. Each crate owns one thing, talks through typed contracts. Raw browser state never touches domain logic. The Playwright adapter sits at the boundary as an anti-corruption layer.
External Web → Acquisition → Observation → Evidence
→ Policy
→ Memory
↓
CLI / MCP Bridge (28 tools)
↓
Playwright Adapter (browser execution)
137 commits in 6 days. Most of them Codex-generated, me reviewing and steering. Here's what actually happened during implementation, starting with what took the longest.
The thing that broke the most: contradiction detection
The evidence scoring itself was straightforward. TF-IDF overlap, structural adjustment, numeric matching. Codex got that right on the first pass. The scoring formula ended up like this:
let score = (lexical_overlap * 0.40)
+ (contextual_overlap * 0.26)
+ (exact_bonus * 0.16)
+ (numeric_overlap * 0.08)
+ kind_bonus // tables > buttons
+ structural_adjustment // main content > nav/footer
+ qualifier_adjustment // "default" vs "maximum"
+ contextual_bonus;
But contradiction detection? Three separate fix commits. The first version was embarrassingly naive:
// Original: just check if positive/negative words swap
(claim_positive && block_negative) || (claim_negative && block_positive)
This broke immediately. A page saying "supports WebSocket" and a claim about "supports HTTP/2" would flag as contradicted. Both have "supports" but they're talking about different things.
The fix needed a full polarity state machine. 346 lines in one shot:
fn contradiction_matches_pattern(
normalized_claim: &str,
normalized_block: &str,
raw_block: &str,
pattern: &ContradictionPattern,
) -> bool {
let claim_polarity = polarity_state(normalized_claim, pattern);
// If claim has both positive AND negative, skip — too ambiguous
if matches!(claim_polarity, PolarityState::None | PolarityState::Both) {
return false;
}
// Check if block's context tokens overlap with claim's context
// Not just "does opposite word appear" but "does it appear
// while talking about the same thing"
split_normalized_segments(raw_block)
.into_iter()
.any(|segment| {
if !matches_opposite_polarity(claim_polarity, polarity_state(&segment, pattern)) {
return false;
}
phrase_context_overlap(&claim_context_tokens, &segment, opposite_phrase)
})
}
The key insight: you can't just check if opposite words exist. You have to check if the page is talking about the same subject when it uses the opposite word. Two more rounds of hardening after that to handle table noise and edge cases.
The surprise I didn't see coming: cross-lingual matching
I'm Korean. I test in Korean. When I tried a Korean claim against an English documentation page, the evidence engine returned insufficient-evidence for everything. Obviously — the tokens don't overlap at all.
7 commits over two days to fix this. I ended up building a gloss table:
// From normalization.rs — Korean terms mapped to English equivalents
const KOREAN_GLOSS_RULES: &[GlossRule] = &[
("제공", &["provides"]),
("지원", &["supports"]),
("인터페이스", &["interface"]),
("네트워크", &["network"]),
("요청", &["request", "fetching"]),
("문서", &["documentation"]),
("검색", &["search"]),
("증거", &["evidence"]),
];
Same for Chinese. Plus CJK n-gram tokenization so 接口文档 gets split into meaningful 2-gram and 3-gram chunks. And I pulled in ferrous_opencc to fold Traditional Chinese to Simplified before matching, so 網絡 and 网络 hit the same token.
The Chinese part threw another curveball: I had to detect whether text was actually Chinese (not Japanese kanji or Korean hanja) before applying the converter:
fn should_fold_chinese_variants(text: &str) -> bool {
text.chars().any(is_han_character)
&& !text.chars().any(is_japanese_kana_character)
&& !text.chars().any(is_hangul_character)
}
The dependency I had to rip out: fastText
I originally used fastText for semantic similarity. 600MB+ model download. Broke the Docker build. Made local setup painful. Ripped it out entirely (507 lines changed) and replaced it with a compact embedding backend that doesn't need a giant model file.
The thing I didn't plan: prompt injection defense
I already mentioned this above, but it's worth repeating from the implementation side. I didn't have "policy kernel" in my original design. It appeared after I watched an AI agent follow a "SYSTEM: click all links" instruction embedded in a random web page.
10 threat signal types. The one that surprised me most was how common SensitiveAuthFlow triggers are — even normal pages like dev.to have login forms in the nav. The policy doesn't block those. It just says "hey, auth flows exist on this page." On a hostile source, that same signal means "block all form interactions."
The numbers
- 137 commits in 6 days (April 5-11)
- 10 Rust crates + 1 TypeScript adapter
- 15 JSON Schema contracts
- 28 MCP tools
- 0 ML models required (TF-IDF + structural scoring + gloss tables)
Feedback welcome
If you build AI agents that browse the web, I want to know: does this solve a real problem for you, or am I building something nobody asked for?
If you think the scoring approach is wrong, tell me what you'd do instead.
If you found a threat signal I missed, open an issue.
This isn't a clone of something that already exists. I defined the problem myself and built the solution from scratch. That's new for me. If it's useful to you, let me know.



