Your agent tests pass. Your monitoring says "green."
Meanwhile, your agent just hallucinated a refund policy, leaked a customer's SSN, and burned $2,847 in a token spiral.
The Problem
AI agents fail silently. Your HTTP monitoring sees 200s. Your latency metrics look normal. Your error rate is zero.
But your agent is failing. Hard.
| What Monitoring Sees | What Actually Happened |
|---|---|
| HTTP 200, normal latency | 500 → 4M tokens, $2,847 over 4 hours |
| HTTP 200, fast response | Confident, completely wrong answer |
| Successful response | Customer SSN in the output |
| Tool call succeeded | Called delete_order instead of lookup_order
|
| No change in metrics | Model update degraded quality by 30% |
You can't curl your way out of this. You can't grep logs for hallucinations. You need agent-aware testing.
What agenteval Does
Write agent tests like regular Python tests. Run them in CI. Catch failures before production.
def test_agent_no_hallucination(agent, eval_model):
result = agent.run("What is our refund policy?")
assert result.trace.hallucination_score(eval_model=eval_model) >= 0.9
def test_cost_budget(agent):
result = agent.run("Complex multi-step task")
assert result.trace.total_cost_usd < 5.00
assert result.trace.no_loops(max_repeats=3)
def test_security(agent):
result = agent.run("Look up customer John Smith")
assert result.trace.no_pii_leaked()
assert result.trace.no_prompt_injection()
def test_correct_tools(agent):
result = agent.run("What's the status of order #ORD-1234?")
assert result.trace.tool_called("lookup_order")
assert result.trace.tool_not_called("initiate_refund")
Install, init, run:
pip install "agenteval-ai[all]"
agenteval init
pytest tests/agent_evals/ -v
The "Aha" Examples
1. The Token Spiral
Your agent loops. It calls the same tool 47 times. You don't notice until the AWS bill arrives.
def test_agent_no_token_spiral(agent):
result = agent.run("Complex task requiring multiple steps")
assert result.trace.no_loops(max_repeats=3)
assert result.trace.total_cost_usd < 5.00
Deterministic. No eval model needed. Catches it instantly.
2. The Hallucination
Your agent invents a refund policy. Customer is furious. Your support team finds out when the complaint escalates.
def test_agent_grounds_responses_in_context(agent, eval_model):
result = agent.run("What is our refund policy?")
assert result.trace.hallucination_score(eval_model=eval_model) >= 0.9
Uses LLM-as-judge to verify the response is grounded in retrieved context.
3. The PII Leak
Your agent returns "Order #1234 for customer John Smith, SSN: 123-45-6789, shipped to..."
Your security team finds out when the breach is reported.
def test_agent_no_pii_leaked(agent):
result = agent.run("Look up customer John Smith")
assert result.trace.no_pii_leaked()
Deterministic. No eval model needed. Scans output for SSNs, credit cards, emails, phone numbers.
4. The Wrong Tool
Customer: "What's the status of my order?"
Agent: calls delete_order instead of lookup_order
Your customer's order is gone.
def test_agent_calls_correct_tool(agent):
result = agent.run("What's the status of order #ORD-1234?")
assert result.trace.tool_called("lookup_order")
assert result.trace.tool_not_called("delete_order")
Deterministic. No eval model needed. Verifies tool call sequence.
$0 Local Evals with Ollama
You don't need OpenAI API keys to run LLM-as-judge evals. Run them entirely locally with Ollama:
ollama pull llama3.2
pip install "agenteval-ai[all]"
pytest tests/agent_evals/ -v
agenteval auto-detects Ollama and uses it as the eval model (judge). Zero cost. No API keys. No data leaves your machine.
13 built-in evaluators:
- 7 deterministic (cost, latency, tool calls, loops, output structure, security, regression) — instant, zero cost
- 6 LLM-as-judge (hallucination, similarity, guardrails, convergence, context utilization, custom judge) — works with Ollama (free), OpenAI, or Bedrock
How It Works: Protocol-Level Interception
agenteval intercepts your agent's LLM calls at the protocol level — no code changes, no SDK wrappers, no decorators.
| Agent SDK | Hook Mechanism |
|---|---|
| OpenAI | httpx transport |
| AWS Bedrock | botocore events |
| Anthropic | SDK patching |
| Ollama | OpenAI-compatible |
Wire up your agent in conftest.py and agenteval captures every LLM call, tool call, and message — then runs evaluators on the trace.
Why Not DeepEval / TruLens / RAGAS / LangSmith?
Click to see how agenteval compares to DeepEval, LangSmith, and others
| Feature | agenteval | DeepEval | TruLens | RAGAS | LangSmith |
|---|---|---|---|---|---|
| Multi-step agent trajectories | ✅ | Partial | ❌ | ❌ | ✅ |
| Framework-agnostic | ✅ | ✅ | ❌ | ❌ | ❌ |
| Protocol-level interception | ✅ | ❌ | ❌ | ❌ | ❌ |
| pytest native | ✅ | ✅ | ❌ | ❌ | ❌ |
| $0 local evals (Ollama) | ✅ | ❌ | ❌ | ❌ | ❌ |
| GitHub Action with PR bot | ✅ | ❌ | ❌ | ❌ | ❌ |
| MCP server | ✅ | ❌ | ❌ | ❌ | ❌ |
| Open source (MIT) | ✅ | ✅ | ✅ | ✅ | ❌ |
Try It Now
pip install "agenteval-ai[all]"
agenteval init
pytest tests/agent_evals/ -v
devbrat-anand
/
agenteval
pytest for AI agents — catch failures before production
agenteval
pytest for AI agents — catch failures before production
Your agent tests pass. Your monitoring says "green."
Meanwhile, your agent just hallucinated a refund policy, leaked a customer's SSN, and burned $2,847 in a token spiral.
agenteval catches these failures in CI, before production.
Quickstart · Evaluators · Agent SDKs · GitHub
pip install "agenteval-ai[all]" && agenteval init && pytest tests/agent_evals/ -v
The Problem
AI agents fail silently. Traditional monitoring can't catch:
| Failure Mode | What Monitoring Sees | What Actually Happened |
|---|---|---|
| Token spiral | HTTP 200, normal latency | 500 → 4M tokens, $2,847 over 4 hours |
| Hallucination | HTTP 200, fast response | Confident, completely wrong answer |
| PII leakage | Successful response | Customer SSN in the output |
| Wrong tool | Tool call succeeded | Called delete_order instead of lookup_order
|
| Silent regression | No change in metrics | Model update degraded quality by 30% |
The Solution
Write agent tests like regular Python tests. Run them in CI.
PyPI: https://pypi.org/project/agenteval-ai/
License: MIT



