GET Serves Cache, POST Runs Inference: Cost Safety for a Public LLM Endpoint

Dev.to / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The author describes a toy public LLM endpoint (amtaitfy.com) designed to intentionally give wrong answers while keeping abuse bounded and costs predictable.
A key architecture choice is to make GET serve only cached responses and allow fresh AI inference exclusively via POST, preventing viral traffic from exploding compute bills.
The service uses Cloudflare Turnstile plus request-size checks, cache checks, and session lockouts to throttle prompt-extraction attempts and repeat-query cost amplification.
It adds a lightweight “tripwire” for low-effort prompt-extraction probes (e.g., asking to ignore instructions or reveal hidden/system prompts) and responds with generic refusal to avoid providing attackers useful signals.
The article outlines a threat model that covers accidental viral traffic, basic bot/spam, provider outages, and budget exhaustion, while explicitly excluding stronger adversaries like sophisticated botnets and attacks requiring sensitive workloads or authentication.

I built a site that gives deliberately wrong answers using an LLM.

No login. No user API key. Anyone can hit the endpoint.

amtaitfy.com is a toy site that gives intentionally wrong answers, generated by AI. This narrows the engineering problem:

Make abuse bounded
Make costs predictable
Make casual attacks boring

The core architectural decision is simple:

GET serves cache only. POST is the only path that triggers fresh AI inference.

Everything else is defense in depth.

Threat model

In scope:

Accidental viral traffic
Casual prompt-extraction probes
Repeat-query cost amplification
Basic bot and spam traffic
Provider outages
Budget exhaustion

Out of scope:

Sophisticated botnets
Attackers with unlimited valid Turnstile tokens
Full prompt-injection resistance
Cache poisoning by determined users
Sensitive workloads
Anything that should require authentication

The request flow

GET /answer  
  read cache  
  return cached answer or empty state

POST /answer  
  verify Turnstile token  
  reject missing session  
  reject oversized input  
  check session lockout  
  check existing cache  
  call ai provider  
  write cache  
  return answer

GET is cheap. POST is expensive. On purpose.

If a URL gets shared, crawled, screenshotted, bookmarked, or posted somewhere large, none of that triggers inference. It can go viral and cost me nothing. Only an intentional POST can do that. The first visitor may trigger one inference through POST. Every later visitor to that URL gets the cached answer from Cloudflare KV. Virality does not balloon cost.

Casual probe friction

I added a small ruleset for obvious prompt-extraction probes:

“ignore previous instructions”
“print your system prompt”
“reveal your hidden prompt”

This is not real prompt-injection defense. It catches low-effort probes and gives me a tripwire.

The first version was stupid. When it detected an extraction attempt, it responded with a hostile message and included my actual system prompt, followed by “There will be cake.”

The GLaDOS reference felt clever for about five minutes.

The current response gives no useful matching detail. No prompt content. No explanation of what was caught. Just a generic refusal. The goal is to provide no signal.

Session lockout

When the extraction tripwire fires, the session gets a short lockout.

I store a 60-second KV entry keyed by session. Further POST attempts during that window return a 403 with a countdown.

The IP lockout I removed

I originally added a second lockout key based on a hash of the user’s IP.

normal session gets locked
user opens incognito
new session cookie
same IP
lockout still applies

But I removed it.

CGNAT makes IP-based lockouts dangerous. Mobile carriers, corporate networks, apartment complexes, and some home ISPs can place many users behind one external IP. Locking out an IP to stop one bad session creates collateral damage that has an unacceptably large blast radius. For this site, session-only lockout is the better tradeoff. It leaves a known bypass, but avoids locking out innocent users.

Timing leaks

The prompt extraction regex detection returns almost instantly. A model response takes two to five seconds. That difference creates a timing side channel, which is useful information to an attacker for iterating around the filter.

So all lockout responses now wait until total request time lands in a random window, roughly matching model latency. Randomized latency removes a potential information vector for an attacker.

Cache forever: How GET stays cheap

The cache is the main cost-control mechanism. Repeat prompts should not create repeat inference costs.

But “cache forever” has sharp edges.

The first caller effectively defines the canonical answer. The first caller can also define a bad canonical answer. I treat the first answer as canonical on purpose. URLs stay shareable, repeat traffic stays free, and the occasional dud is the price.

The cache is not namespaced by prompt version. There is no elegant invalidation layer. If the system prompt changes or a bad answer becomes canonical, the fix is manual cleanup or a broader cache reset.

The future upgrade would be to add a version prefix to cache keys so prompt changes, model changes, or answer-format changes can move to a new cache namespace without serving old entries.

Something like:

cache:v3:<hash(normalized_prompt)>

KV counters vs Durable Objects

I use KV counters for operational telemetry:

Daily estimated spend
Provider health
Probe counts
Rough request volume

KV is eventually consistent. Under burst traffic, two near-simultaneous writes can miss each other and produce an undercount.

Durable Objects would give stronger consistency but I did not use them.

For this site, the counters are not the final safety mechanism. They are telemetry. Eventual consistency is fine for coarse signals. It is not fine for the only budget guardrail.

When to move to DOs? I have a predefined migration trigger. Request rate is easy to see from Worker analytics. Counter drift would have to be measured by reconciliation: compare KV counters against provider usage or request logs. If reconciliation shows KV estimates drifting materially from provider-reported usage, move counters to Durable Objects.

Provider strategy

I've found Free-tier AI providers on OpenRouter are unreliable. Paid inference is the fallback. However, paid inference means that on an especially viral day, AI spend could spike beyond what I can afford. OpenRouter's daily spend caps are a lifesaver here. Of course, a determined attacker could burn through the daily budget and push the site into degraded mode.

Degraded mode UX

When all selected providers fail or I've exhausted my daily budget for inference, the page does not show a generic error. It surfaces a few cached answers as clickable suggestions and shows a retry timer.

The retry timer backs off:

10s → 30s → 2m → 5m

If an upstream provider sends a Retry-After header, the UI honors it.

This turns an outage into something closer to discovery. The user came for wrong answers. Cached wrong answers are still useful product surface.

This may be my favorite second-order effect in the project.

What I would change if traffic grew

I would change the pressure points.

Move counters from KV to Durable Objects
Add paid Cloudflare rate limiting
Add better cache moderation and purge tooling
Add model and prompt version dashboards
Add better observability around provider failure modes

Keep GET cache-only and POST inference-only as a hard boundary.

If you are building public AI endpoints, I am especially interested in where you draw the line between “cheap enough to tolerate abuse” and “serious enough to justify paid controls.”

Try out the "wrong answers" engine at amtaitfy.com

Black Hat USA

AI Business

Context Compression in .NET

Dev.to

Subagents: The Building Block of Agentic AI

Dev.to

Canva apologizes after its AI tool replaces ‘Palestine’ in designs

The Verge

Why Cursor Keeps Writing MD5 Password Hashes (CWE-328)

Dev.to

GET Serves Cache, POST Runs Inference: Cost Safety for a Public LLM Endpoint

Key Points

Threat model

The request flow

Casual probe friction

Session lockout

The IP lockout I removed

Timing leaks

Cache forever: How GET stays cheap

KV counters vs Durable Objects

Provider strategy

Degraded mode UX

What I would change if traffic grew

Related Articles

Black Hat USA

Context Compression in .NET

Subagents: The Building Block of Agentic AI

Canva apologizes after its AI tool replaces ‘Palestine’ in designs

Why Cursor Keeps Writing MD5 Password Hashes (CWE-328)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer