Building an AI agent that works in a demo is easy. Building one that works reliably in production is a completely different engineering challenge.
Production systems must handle real users, real data, and real consequences when things fail.
This is the production agent architecture I use across Brainfy AI and Navlyt, along with real code patterns and failure modes I design around.
What Makes Production Agents Different From Demo Agents
Demo agents optimize for the happy path.
Production agents must handle:
Real data variance
Production inputs are messy, ambiguous, and full of edge cases.Concurrent executions
Multiple agent instances running simultaneously with shared state.Long-running tasks
Agents that may take minutes or hours requiring durable execution state.Cost management
Confused agents making unnecessary tool calls can become expensive quickly.Observability
You must understand exactly what the agent decided and why.
The Core Architecture: Durable Agent State
The most important production decision:
Keep agent state in a database — not in memory.
In-memory state:
- Dies with the server
- Cannot scale horizontally
- Cannot be audited
Database state:
- Survives restarts
- Enables horizontal scaling
- Provides observability
- Enables debugging
Example schema:
-- Agent execution state table
CREATE TABLE agent_executions (
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
user_id UUID REFERENCES auth.users NOT NULL,
agent_type TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
CONSTRAINT valid_status CHECK (
status IN (
'pending',
'running',
'completed',
'failed',
'cancelled',
'awaiting_review'
)
),
input_data JSONB NOT NULL,
state JSONB DEFAULT '{}',
result JSONB,
error TEXT,
step_count INTEGER DEFAULT 0,
token_count INTEGER DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
completed_at TIMESTAMPTZ
);
-- Tool call log for observability
CREATE TABLE agent_tool_calls (
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
execution_id UUID REFERENCES agent_executions NOT NULL,
step_number INTEGER NOT NULL,
tool_name TEXT NOT NULL,
tool_input JSONB NOT NULL,
tool_output JSONB,
status TEXT NOT NULL DEFAULT 'pending',
latency_ms INTEGER,
error TEXT,
called_at TIMESTAMPTZ DEFAULT NOW()
);
The Agent Loop With Production Safeguards
Production agents need hard limits.
Example safeguards:
- Step limits
- Token limits
- Timeout limits
- Failure conditions
Example TypeScript loop:
// lib/agents/production-agent.ts
const AGENT_LIMITS = {
maxSteps: 25,
maxTokens: 50_000,
stepTimeoutMs: 30_000,
totalTimeoutMs: 300_000
}
export async function runAgent(
executionId: string,
supabase: SupabaseClient
): Promise<void> {
const startTime = Date.now()
let execution = await loadExecution(
executionId,
supabase
)
await updateStatus(
executionId,
'running',
supabase
)
while (true) {
const elapsed =
Date.now() - startTime
if (execution.step_count >= AGENT_LIMITS.maxSteps){
await failWithReason(
executionId,
'MAX_STEPS_EXCEEDED',
supabase
)
return
}
if (execution.token_count >= AGENT_LIMITS.maxTokens){
await failWithReason(
executionId,
'MAX_TOKENS_EXCEEDED',
supabase
)
return
}
if (elapsed >= AGENT_LIMITS.totalTimeoutMs){
await failWithReason(
executionId,
'TOTAL_TIMEOUT',
supabase
)
return
}
const response =
await callModel(messages, TOOLS)
execution.step_count++
execution.token_count +=
response.usage?.total_tokens ?? 0
await persistState(
executionId,
execution,
supabase
)
}
The Human-in-the-Loop Gate
For actions that are difficult to reverse, I require human approval.
The agent:
- Prepares the action
- Sets status to awaiting_review
- Stops execution
- Waits for approval
Example:
const APPROVAL_REQUIRED_TOOLS = [
'send_email',
'update_customer_record',
'generate_compliance_document',
'submit_to_regulator'
]
async function executeToolCall(
toolCall,
executionId,
supabase
){
if(APPROVAL_REQUIRED_TOOLS.includes(name)){
await updateStatus(
executionId,
'awaiting_review',
supabase
)
throw new AgentPausedError(
'Human approval required'
)
}
return await callTool(name,args)
}
Monitoring: What I Track in Production
Metrics I monitor:
- Step efficiency
- Tool success rate
- Human review escalation rate
- Token cost per completion
- Completion rate
Example health query:
const { data } =
await supabase.rpc(
'agent_health_metrics',
{
agent_type:
'compliance_document_generator',
since:
new Date(
Date.now() -
7 * 24 * 60 * 60 * 1000
).toISOString()
}
)
Typical results:
- Completion rate: 94%
- Avg steps: 8.3
- Human review rate: 3.1%
Key Lessons
Production agents require:
- Durable state
- Hard execution limits
- Observability
- Cost controls
- Human approval gates
Most failures come from missing safeguards, not model quality.
About the Author
Tilak Raj
Founder & CEO — Brainfy AI
Building vertical AI SaaS across compliance, real estate, agriculture, and aviation.
Website: https://www.tilakraj.info
Projects: https://www.tilakraj.info/projects
Questions about production agents? Drop a comment — I reply to all of them.