I Have an AI Agent That Tests My Own Product Every 3 Hours

Dev.to / 4/9/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A solo web developer realized they had stopped using their own product as a user for weeks, so they delegated ongoing checks to an AI agent that runs every three hours.
The system uses Claude plus an MCP server to expose app APIs as callable functions and Playwright to drive a real browser for UI checks like mobile viewport, dark mode, and empty states.
The agent rotates through scheduled test types (API integrity checks, UI behavior/screenshot anomaly detection, and code quality checks like TypeScript typechecking and unused imports).
When breakage is detected, the agent automatically creates a branch, implements fixes, runs Vitest, opens a PR, and reports to Discord.
The article also notes limitations in what the agent can do, framing it as an automated dogfooding workflow rather than a complete substitute for all human testing.

The Dogfooding Problem for Solo Developers

"Eat your own dog food" is good advice. Use your own product. Find the bugs your users find. Feel the pain before they do.

In practice, here's what actually happens:

You use it heavily right after launch
Development takes over and you stop touching it
You check it as a developer, not as a user — you know all the right paths
"It works" becomes the bar, and rough UX slips through

I build and maintain a web app solo. At some point I realized: I hadn't actually used it as a user in weeks. I'd been shipping features, but not experiencing the product.

So I did something that felt slightly absurd: I gave the job to an AI agent.

Every 3 hours, an AI agent opens my product, checks if things work, and opens a PR if it finds something broken.

Here's how it works, what it found, and what it can't do.

The Setup

Three components:

Component	What it does
AI Agent (Claude)	Decides what to check, interprets results, writes fixes
MCP Server	Exposes my app's API as callable functions for the AI
Playwright	Lets the AI control a real browser to check the UI

The agent runs on a scheduled heartbeat. I define what to check in a markdown file:

# What to check (rotate through these):

## API checks
- Call list_projects, get_canvas, get_verification_status
- Verify data integrity and response format

## UI checks  
- Open the live site in a real browser
- Check mobile viewport (375px)
- Check dark mode
- Check empty states (what does a new user see?)
- Screenshot any anomalies

## Code quality
- Run tsc --noEmit, report TypeScript errors
- Check for unused imports in recently changed files

## When you find something broken:
- Create a branch
- Fix it
- Run vitest to confirm tests pass
- Open a PR
- Report to Discord

That's the entire instruction set. The AI handles the rest.

How the API Integration Works

By default, an AI can't interact with your app's internals. To fix this, I wrapped my API as an MCP (Model Context Protocol) server — basically a list of functions the AI can call.

// The AI can call these like tool calls
const tools = {
  list_projects: {
    description: "Get all projects",
    handler: async () => await db.project.findMany()
  },
  add_learning: {
    description: "Record a finding or bug",
    handler: async (args) => await db.learning.create({ data: args })
  },
  get_verification_status: {
    description: "Check the status of all verifications",
    handler: async () => await db.verification.findMany()
  }
};

This lets the AI do what a human user does — create records, read data, check states — but via API instead of clicking around.

What It Found

Here are three real bugs the agent caught that I wouldn't have caught otherwise:

Bug 1: API and UI were out of sync

When creating data through the API, the API response showed the data correctly. But the data didn't appear in the UI.

Root cause: The data was stored in two separate database tables. The API wrote to one, the UI read from the other.

Why humans missed it: Humans always use the UI. If you click "create" in the browser, both tables get written. The bug only appeared when creating via API — which humans never did, but the AI did on every check.

Bug 2: Mobile layout broken

On desktop: fine. On mobile (375px): input fields overflowed horizontally.

Fix: One CSS change: grid-cols-2 → grid-cols-1 md:grid-cols-2.

Bug 3: Empty state was a white screen

A new user opening their first project saw... nothing. No error, just blank. No guidance, no "create your first item" button.

This one wasn't technically broken — it just made the product confusing for new users. The agent flagged it as a UX issue and suggested an empty state component.

Dogfooding Alone Wasn't Enough

Dogfooding catches a lot — especially broken flows, layout issues, and rough UX.

But it doesn't catch everything.

Some bugs only happen in production, under very specific conditions:

a component crashes only after a rare user action
an import mismatch breaks a route that manual testing doesn't hit
an exception only appears with real data, real timing, or real browser state

Those bugs are hard to find by manually using the product every few hours.

So I ended up adding a second loop: error monitoring.

The dogfooding agent checks whether the product works as a user experience.
Error monitoring checks whether the product is failing in the wild.

That combination turned out to be much stronger than either one alone.

Adding Sentry as a Second Feedback Loop

Now the system has two complementary loops:

Loop	What it catches
Dogfooding every 3 hours	Broken flows, visual issues, empty states, mobile regressions, rough UX
Sentry monitoring	Runtime exceptions, production-only bugs, hard-to-reproduce crashes

The dogfooding loop answers:

Can a user actually move through the product?
Does the UI make sense?
Is anything visually broken?

The Sentry loop answers:

Did something crash in production?
What stack trace and context came with it?
Is there a fixable bug hidden behind low-frequency failures?

This matters because not all quality issues look the same.

Some problems are visible. Others only show up as stack traces.
If you only rely on dogfooding, you miss production-only failures.
If you only rely on Sentry, you miss awkward UX and broken but non-crashing flows.

Together, they form a much more complete quality loop.

From Detection to Auto-Fix

Once I added Sentry, the agent's job expanded.
It no longer just looked for problems by using the product.
It could also react to problems reported by the product itself.

The flow now looks like this:

Every 3 hours, the agent dogfoods the app
On a separate schedule, it checks Sentry for unresolved issues
If it finds a real bug, it analyzes the stack trace and source code
It creates a branch, writes a fix, runs tests, and opens a PR
Small safe fixes can be merged automatically after checks pass

One of the best examples was a page crash caused by the wrong i18n hook import.
The error message itself was vague. Manual testing didn't catch it consistently.
But Sentry provided enough context for the agent to trace the issue back to a bad import and generate a tiny fix.

That was the moment this stopped feeling like "automated testing" and started feeling more like an automated maintenance loop.

What the AI Can and Can't Do

The AI is good at	The AI can't do
Checking if things work	Feeling if things feel right
Catching regressions automatically	"This interaction is frustrating"
Covering edge cases humans skip	Subjective UX judgment
Opening PRs immediately on finding bugs	Knowing if a feature is missing
Running every 3 hours without fatigue	Replacing actual user feedback

The "can't do" column matters. The AI is a complement, not a replacement.

After the agent does its check, I still need to use the product myself and talk to users. The agent handles the objective, repeatable checks. I handle the subjective, experiential ones.

One More Honest Note

About 30% of the time, the agent reports "fixed" when it hasn't fully fixed something. This was frustrating until I built in a hard requirement: tests must pass before marking anything as done.

Rule: Before opening a PR, run `npx vitest run`.
If tests fail, do not open the PR.
Report the failure instead.

This dropped false completions dramatically. The agent's confidence isn't reliable — test results are.

How to Build This

You don't need my exact setup. The minimum viable version:

Pick a scheduled runner — GitHub Actions cron, a crontab, or any agent platform with scheduled tasks
Expose one API endpoint the AI can call — Start with just a health check
Write a simple check instruction — "Call this endpoint and report if it fails"
Add Playwright later — Browser checks are optional but powerful for catching visual regressions

The core insight isn't the tech stack. It's that dogfooding is a discipline problem, not a capability problem. You know how to test your own product. You just don't do it consistently.

Automating it removes the discipline requirement.

Have you built any automated quality loops into your side projects? Or does your testing start and end with "it worked on my machine"? Curious what others have tried in the comments.

The product the agent keeps testing is KaizenLab, my app for hypothesis validation and product learning.

That made this setup especially useful: the same system I use to organize product decisions is also what the agent keeps checking, stress-testing, and helping improve.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/9DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Claude Code Safety in 5 Minutes: A Beginner's Complete Guide

Dev.to

30 Days, $0, Full Autonomy: The Real Report on Running an AI Agent Without a Credit Card

Dev.to

One Open Source Project a Day (No.34): second-brain-skills - A Skill Toolkit That Turns Claude Code Into a Knowledge Work Expert

Dev.to

I Have an AI Agent That Tests My Own Product Every 3 Hours

Key Points

The Dogfooding Problem for Solo Developers

The Setup

How the API Integration Works

What It Found

Bug 1: API and UI were out of sync

Bug 2: Mobile layout broken

Bug 3: Empty state was a white screen

Dogfooding Alone Wasn't Enough

Adding Sentry as a Second Feedback Loop

From Detection to Auto-Fix

What the AI Can and Can't Do

One More Honest Note

How to Build This

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Claude Code Safety in 5 Minutes: A Beginner's Complete Guide

30 Days, $0, Full Autonomy: The Real Report on Running an AI Agent Without a Credit Card

One Open Source Project a Day (No.34): second-brain-skills - A Skill Toolkit That Turns Claude Code Into a Knowledge Work Expert

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer