I Built Four Tools with Claude Code. None of Them Had Tests. So I Fixed That

Dev.to / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The author describes building multiple tools with Claude Code without automated tests, relying on manual verification even when changes touched core accounting logic.
While ERPClaw invoicing initially worked after an AI-driven refactor, a later financial report showed the ledger was unbalanced due to incorrect distribution of compound tax calculations across invoice line items.
The incident highlighted how “it runs and totals look right” can still produce silent correctness failures in financial software, where small errors can compound unnoticed until reporting.
A single missing unit test for multi-line invoices with compound tax rates would likely have caught the regression quickly, but without coverage it took 45 minutes of manual tracing to locate the bug.
The author’s experience shows a recurring pattern: AI-assisted coding speeds up implementation, but test discipline is still essential to ensure ongoing reliability as projects grow in complexity.

I had just finished a two-hour Claude Code session on ERPClaw. The invoicing workflow was coming together. Journal entries were generating, invoices were producing the right totals, and the OpenClaw integration was responding correctly. I closed the terminal and thought: I should check if any of this broke anything in the accounting side.

Then I realized I had no way to check. There were no tests. There had never been any tests.

I opened the GL reconciliation logic and stared at it for a while. Claude had touched five files in that session. Any one of them could have introduced something subtle. I had no coverage to fall back on. The only way to verify was to mentally trace through the whole accounting flow by hand.

I did that. It took 45 minutes.

Everything was fine that time.

This Was Not the First Time

I had shipped the PHP Reddit API the same way. Claude wrote the core structure fast, the PSR compliance fell into place, the Laravel bridge worked on the first try. The PHP community picked it up. Real people were using it. No tests.

Same with SiteKit. Same pattern every project: Claude Code writes the code quickly, it works, you ship it, and somewhere in the back of your mind you know that "works now" is not the same as "will keep working."

ERPClaw is the most serious of these. It is an AI-native ERP system for the OpenClaw platform. Accounting, invoicing, inventory, payroll, tax, financial reporting -- 413 actions across 14 domains. When I say financial software, I mean software where a small bug in a journal entry can compound silently across hundreds of transactions and not announce itself until someone runs a report that should balance and does not.

I built all of that without tests.

The Incident That Changed My Approach

During an ERPClaw session, Claude refactored the journal entry creation logic for the invoicing workflow. The refactor was reasonable. The code ran cleanly. Invoices generated. Totals looked correct. I moved on to the next feature.

Two sessions later I was reviewing a financial report. The ledger was off. On multi-line invoices with compound tax rates, debits and credits were not balancing. Not by much -- but double-entry accounting does not have a "close enough." The ledger either balances or it does not.

Nothing had crashed. There was no error. The numbers were quietly wrong.

I traced it manually. The journal entry creation was distributing the compound tax calculation incorrectly across line items. It was a single logic error in one function. A unit test with a multi-line invoice and a compound tax rate would have caught it in three seconds.

Instead it took 45 minutes of manual tracing to find it.

The math became obvious: ERPClaw has 413 actions. That is 413 potential windows where a silent regression could sit undetected between when Claude writes the code and when I manually notice something is off. At some point, discipline stops being a viable strategy.

Why CLAUDE.md Instructions Do Not Solve This

I had tried the obvious fix. I put test instructions in CLAUDE.md. "Write tests after every file edit." Claude followed it sometimes, ignored it in long sessions, and there was nothing I could do to enforce it.

CLAUDE.md instructions are advisory. Claude reads them at the start of a session and applies them to the best of its ability. But in a complex multi-file session where Claude is focused on architecture, test generation falls off. It is not a bug in Claude -- it is how any attention-based system works under load.

The problem was structural. Tests required a separate act of will. Someone had to decide to write them. AI coding moves fast enough that "I'll do it after this task" means "I'll do it never."

So I Built tailtest

The fix was straightforward once I saw it clearly: hook into the file write event itself. Do not rely on Claude deciding to write tests. Make tests happen automatically as a consequence of Claude writing any file.

Claude Code has a PostToolUse hook. It fires after every tool call -- including every file write. tailtest uses that hook.

When Claude writes a file:

tailtest fires
It runs an intelligence filter (more on this below)
If the file is worth testing, it generates test scenarios for the code that was just written
It runs those tests immediately
If everything passes: nothing. Silent. You keep working.
If something fails: specific output, in the same session, while you still know what changed

The silence-on-pass decision was deliberate. If tailtest talks every time a test passes, you start ignoring it within a week. The only time it surfaces output is when something actually needs your attention. That is the only design that survives long-term use.

The Intelligence Filter

Not every file Claude writes is worth testing. Config files are not. Schema migrations are not. Boilerplate index files are not. If tailtest ran on all of them, it would be noisy, slow, and would generate useless test output.

tailtest runs an intelligence filter before generating anything. It looks at the file extension, the path, and the content patterns to decide whether this is a file containing logic worth testing. Services, utilities, domain models, controllers, business logic -- these get tested. Configuration, migrations, generated files -- these get skipped.

This is not optional. Without filtering, the tool generates noise. Noise causes developers to turn it off. A turned-off testing tool does nothing.

The Ramp-Up Scan

If you install tailtest on an existing project -- like ERPClaw, which had zero tests when I started -- you do not get hit with a thousand test generation runs on the first session. tailtest scans the codebase on first run, identifies files with no coverage, and queues them for gradual background testing. New edits get coverage immediately. Existing files get covered over time.

This matters for real projects. A cold-start that tries to test everything at once is not useful. Gradual coverage is.

The Recursive Part

tailtest now has 332 tests in its own test suite. A significant number of those were generated by tailtest itself while I was building it.

At one point, tailtest caught a bug in its own intelligence filter logic. I had not noticed anything wrong. It fired on a file, ran the tests, and returned a failure on an edge case I had introduced while refactoring the filter's file-type detection.

I considered that the correct moment to ship a testing tool. When your testing tool tests itself and catches its own regressions before you ship it, the concept is proven.

Who This Is For

If you write code with Claude Code and you know you should have tests: tailtest removes the decision. Tests happen. You do not have to remember, you do not have to prompt, you do not have to discipline yourself. Every edit gets a test run.

If you are building with Claude Code and testing feels complicated: tailtest generates the tests. You do not need to know how to write pytest or vitest. You install it, and from that session forward, new code gets tested. You see failures when they happen. You do not need to understand the test framework to benefit from the coverage.

Honest Limits

tailtest generates tests. It does not guarantee they are the right tests. For complex business logic -- the kind of multi-state, multi-entity logic that took domain expertise to design -- a human should review the generated test scenarios. tailtest gives you coverage. It does not replace test strategy.

There is also a token cost. Every PostToolUse invocation uses tokens. A typical session adds roughly $5 or under to your Claude usage. For production software with real users -- ERPClaw, the PHP Reddit API -- that is a rounding error on the cost of shipping a silent regression. For a hobby project you are not maintaining seriously, it is a real tradeoff. I am not going to pretend otherwise.

Languages and Install

Python (pytest), TypeScript and JavaScript (vitest, jest), Go, Rust, Ruby, Java, PHP.

claude plugin marketplace add avansaber/tailtest
claude plugin install tailtest@avansaber-tailtest

No config file. No setup. It detects your language and test runner automatically.

Source: https://github.com/avansaber/tailtest
Website: tailtest.com
Open source, MIT licence, free.

Questions and issues go in the GitHub repo. I read them.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/16DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Are gamers being used as free labeling labor? The rise of "Simulators" that look like AI training grounds [D]

Reddit r/MachineLearning

I built a trading intelligence MCP server in 2 days — here's how

Dev.to

Voice-Controlled AI Agent Using Whisper and Local LLM