I had just finished a two-hour Claude Code session on ERPClaw. The invoicing workflow was coming together. Journal entries were generating, invoices were producing the right totals, and the OpenClaw integration was responding correctly. I closed the terminal and thought: I should check if any of this broke anything in the accounting side.
Then I realized I had no way to check. There were no tests. There had never been any tests.
I opened the GL reconciliation logic and stared at it for a while. Claude had touched five files in that session. Any one of them could have introduced something subtle. I had no coverage to fall back on. The only way to verify was to mentally trace through the whole accounting flow by hand.
I did that. It took 45 minutes.
Everything was fine that time.
This Was Not the First Time
I had shipped the PHP Reddit API the same way. Claude wrote the core structure fast, the PSR compliance fell into place, the Laravel bridge worked on the first try. The PHP community picked it up. Real people were using it. No tests.
Same with SiteKit. Same pattern every project: Claude Code writes the code quickly, it works, you ship it, and somewhere in the back of your mind you know that "works now" is not the same as "will keep working."
ERPClaw is the most serious of these. It is an AI-native ERP system for the OpenClaw platform. Accounting, invoicing, inventory, payroll, tax, financial reporting -- 413 actions across 14 domains. When I say financial software, I mean software where a small bug in a journal entry can compound silently across hundreds of transactions and not announce itself until someone runs a report that should balance and does not.
I built all of that without tests.
The Incident That Changed My Approach
During an ERPClaw session, Claude refactored the journal entry creation logic for the invoicing workflow. The refactor was reasonable. The code ran cleanly. Invoices generated. Totals looked correct. I moved on to the next feature.
Two sessions later I was reviewing a financial report. The ledger was off. On multi-line invoices with compound tax rates, debits and credits were not balancing. Not by much -- but double-entry accounting does not have a "close enough." The ledger either balances or it does not.
Nothing had crashed. There was no error. The numbers were quietly wrong.
I traced it manually. The journal entry creation was distributing the compound tax calculation incorrectly across line items. It was a single logic error in one function. A unit test with a multi-line invoice and a compound tax rate would have caught it in three seconds.
Instead it took 45 minutes of manual tracing to find it.
The math became obvious: ERPClaw has 413 actions. That is 413 potential windows where a silent regression could sit undetected between when Claude writes the code and when I manually notice something is off. At some point, discipline stops being a viable strategy.
Why CLAUDE.md Instructions Do Not Solve This
I had tried the obvious fix. I put test instructions in CLAUDE.md. "Write tests after every file edit." Claude followed it sometimes, ignored it in long sessions, and there was nothing I could do to enforce it.
CLAUDE.md instructions are advisory. Claude reads them at the start of a session and applies them to the best of its ability. But in a complex multi-file session where Claude is focused on architecture, test generation falls off. It is not a bug in Claude -- it is how any attention-based system works under load.
The problem was structural. Tests required a separate act of will. Someone had to decide to write them. AI coding moves fast enough that "I'll do it after this task" means "I'll do it never."
So I Built tailtest
The fix was straightforward once I saw it clearly: hook into the file write event itself. Do not rely on Claude deciding to write tests. Make tests happen automatically as a consequence of Claude writing any file.
Claude Code has a PostToolUse hook. It fires after every tool call -- including every file write. tailtest uses that hook.
When Claude writes a file:
- tailtest fires
- It runs an intelligence filter (more on this below)
- If the file is worth testing, it generates test scenarios for the code that was just written
- It runs those tests immediately
- If everything passes: nothing. Silent. You keep working.
- If something fails: specific output, in the same session, while you still know what changed
The silence-on-pass decision was deliberate. If tailtest talks every time a test passes, you start ignoring it within a week. The only time it surfaces output is when something actually needs your attention. That is the only design that survives long-term use.
The Intelligence Filter
Not every file Claude writes is worth testing. Config files are not. Schema migrations are not. Boilerplate index files are not. If tailtest ran on all of them, it would be noisy, slow, and would generate useless test output.
tailtest runs an intelligence filter before generating anything. It looks at the file extension, the path, and the content patterns to decide whether this is a file containing logic worth testing. Services, utilities, domain models, controllers, business logic -- these get tested. Configuration, migrations, generated files -- these get skipped.
This is not optional. Without filtering, the tool generates noise. Noise causes developers to turn it off. A turned-off testing tool does nothing.
The Ramp-Up Scan
If you install tailtest on an existing project -- like ERPClaw, which had zero tests when I started -- you do not get hit with a thousand test generation runs on the first session. tailtest scans the codebase on first run, identifies files with no coverage, and queues them for gradual background testing. New edits get coverage immediately. Existing files get covered over time.
This matters for real projects. A cold-start that tries to test everything at once is not useful. Gradual coverage is.
The Recursive Part
tailtest now has 332 tests in its own test suite. A significant number of those were generated by tailtest itself while I was building it.
At one point, tailtest caught a bug in its own intelligence filter logic. I had not noticed anything wrong. It fired on a file, ran the tests, and returned a failure on an edge case I had introduced while refactoring the filter's file-type detection.
I considered that the correct moment to ship a testing tool. When your testing tool tests itself and catches its own regressions before you ship it, the concept is proven.
Who This Is For
If you write code with Claude Code and you know you should have tests: tailtest removes the decision. Tests happen. You do not have to remember, you do not have to prompt, you do not have to discipline yourself. Every edit gets a test run.
If you are building with Claude Code and testing feels complicated: tailtest generates the tests. You do not need to know how to write pytest or vitest. You install it, and from that session forward, new code gets tested. You see failures when they happen. You do not need to understand the test framework to benefit from the coverage.
Honest Limits
tailtest generates tests. It does not guarantee they are the right tests. For complex business logic -- the kind of multi-state, multi-entity logic that took domain expertise to design -- a human should review the generated test scenarios. tailtest gives you coverage. It does not replace test strategy.
There is also a token cost. Every PostToolUse invocation uses tokens. A typical session adds roughly $5 or under to your Claude usage. For production software with real users -- ERPClaw, the PHP Reddit API -- that is a rounding error on the cost of shipping a silent regression. For a hobby project you are not maintaining seriously, it is a real tradeoff. I am not going to pretend otherwise.
Languages and Install
Python (pytest), TypeScript and JavaScript (vitest, jest), Go, Rust, Ruby, Java, PHP.
claude plugin marketplace add avansaber/tailtest
claude plugin install tailtest@avansaber-tailtest
No config file. No setup. It detects your language and test runner automatically.
Source: https://github.com/avansaber/tailtest
Website: tailtest.com
Open source, MIT licence, free.
Questions and issues go in the GitHub repo. I read them.



