Skills as invocation contracts, not code: how I keep review authority over agent work

Dev.to / 4/19/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The article argues that “skills” for AI agents should be treated as invocation contracts (e.g., SKILL.md) rather than code, so reviewers verify behavior at the contract level instead of inspecting every generated implementation detail.
  • It describes a real setup (claude-code-mcp-qa-automation) where 16 skills are stored as pure Markdown, while the underlying Python implementation can be swapped or refactored without requiring contract re-review.
  • It explains the four sections of a SKILL.md—frontmatter, inputs, delegated work, and distinguished failure modes—and emphasizes that these sections define expected inputs, behavior, and error handling in a reviewable way.
  • The author contrasts two scaling models: code-as-skill scales poorly because human review must re-read code for each change, while markdown-as-skill scales better because the contract remains stable even when implementations are regenerated by agents.
  • Overall, the approach aims to keep review authority with humans over agent behavior while enabling multiple agents to operate on the same skill surface efficiently.

When a skill is code, you review code. When a skill is a markdown contract, you review the contract — and the implementation gets re-typed by an agent under it. One scales to dozens of agents operating on the same surface. The other collapses the moment the implementation needs to be re-read by a human.

This is the pattern I run in claude-code-mcp-qa-automation. Sixteen skills live as pure markdown files. The Python implementation underneath can be swapped — rewritten, refactored, replaced — without touching the skill surface. Review authority stays where it belongs: on the contracts, not on every generated line.

What a SKILL.md actually contains

A skill file is not documentation. It is an invocation contract. Four sections, each with a specific purpose.

Frontmatter. Name, description, one or two tags. This is what the agent loader uses to select and index the skill. The description is read by the system, not a human — it is optimized for match quality, not prose quality.

Inputs. What the skill expects to receive. Types, shapes, required fields, failure conditions on missing inputs. If a caller sends a malformed input, the skill's behavior is determined here, not in the implementation.

Work delegated. What the skill does. Not how. The contract names the mechanism at the level a reviewer needs to understand: "pulls the sprint's tickets from Jira, aggregates status transitions into the trending store, writes one row per (ticket, day)." The implementation details — which Python library, which HTTP client, which pagination strategy — live in code, not in the contract.

Failure modes distinguished. The specific failure types the skill surfaces separately: transient Jira error, auth failure, schema mismatch, rate-limited retries exhausted. Each with its own handler and its own log shape.

A reviewer reads those four sections. If they match the intended behavior, the skill passes review. The implementation underneath gets swapped on whatever cadence the team needs — daily, if the agent is rewriting it — without triggering another review cycle, because the contract did not change.

Why this scales and code-skills do not

Two patterns, two scaling curves.

Code-as-skill. Each skill is a Python module or a set of functions. A reviewer has to read the code to understand the behavior. Adding a new skill means writing, testing, and reviewing more code. Swapping implementation means rewriting the review. The scaling bottleneck is human review capacity.

Markdown-as-skill. Each skill is a contract. A reviewer reads the contract. Adding a new skill means writing a contract. The implementation is produced under the contract by whichever agent or engineer is fastest at that specific stack. Swapping implementation means regenerating under the same contract.

The second pattern survives agent re-typing. An LLM that regenerates the implementation cannot change the contract without the reviewer noticing. An LLM that regenerates code-as-skill silently changes the surface and the reviewer has to catch it in the diff — which is the failure mode the agentic-engineering discipline names as blind diff-accepting.

Sub-agent orchestration under the contract

The QA-automation pipeline has a coordinator that fans out work to sub-agents. The sub-agents operate per-board (one per Jira board, in parallel). The coordinator aggregates their output into a unified report.

In production, the coordinator would use Claude Code's Agent tool. In the reference implementation, the coordinator uses ThreadPoolExecutor as a structurally-identical stand-in. Same fan-out shape, same aggregation semantics, same deterministic output — but the reference runs without requiring live Claude calls, which makes it reviewable end-to-end on any CI runner.

The skill file for the coordinator names the contract: inputs are a list of boards, outputs are one report per board plus one roll-up, failure of any sub-agent does not fail the whole fan-out. The implementation (ThreadPoolExecutor or Agent tool) is a detail that can change without the contract changing.

Flag-gated, config-driven execution

Every behavior that could be on or off lives in config/flags.yaml with global and board-scoped overrides. No inline if FEATURE_FOO: toggles in the Python.

flags:
  global:
    enable_trending: true
    slack_digest: true
    include_closed_tickets: false
  boards:
    ENG:
      include_closed_tickets: true
    OPS:
      slack_digest: false

Regression debugging starts with flipping a flag and re-running, not with a code spelunk. A bug report comes in: "the OPS digest is missing since Tuesday." The first check is config/flags.yaml — was the flag flipped? If yes, that is the cause. If no, the bug is in the code path, which is the second check.

This separation is what makes the pipeline auditable. The config is source-controlled. Every flag flip is a commit. The diff between "what produced last Friday's report" and "what produced today's report" is always visible.

Deterministic, reviewable output

The reports themselves are single HTML files with inline CSS, zero JavaScript, no external references. That constraint is not cosmetic. It is what makes them reviewable offline, archivable, diffable byte-for-byte under identical flags.

./pipeline.py --config=flags.yaml --board=ENG > run1.html
./pipeline.py --config=flags.yaml --board=ENG > run2.html
sha256sum run1.html run2.html
# should print identical hashes

If the two hashes differ, something in the pipeline is non-deterministic and needs to be found. Most non-determinism comes from incidental sources: dict iteration order in older Python versions, timestamps in the output, random sampling in pagination. Each one has a fix. The constraint forces the fix rather than allowing the non-determinism to hide.

Determinism is what makes the output a compliance-grade artifact. A report that is not reproducible is a report a stakeholder cannot trust. A report that is reproducible is one that can be regenerated from source at any future point, which is the actual definition of "reviewable."

How this connects to Claude-Code skills in general

The skill pattern above is specific to the QA-automation surface, but the mechanics generalize. A skill in Claude Code is a named, addressable contract the agent loader selects from. The best skill files are the ones a reviewer can read in under a minute and the agent can instantiate without ambiguity.

Three properties decide whether a skill is good or not:

1. Invocation-tight. The skill's description tells the loader precisely when to fire and when not to. Loose descriptions produce the wrong skill firing at the wrong moment, which is the most common skill-related bug.

2. Implementation-free. The skill contract does not name Python modules, specific libraries, or internal file paths. Those are details the implementation owns. A skill that references implementation specifics is one that cannot be reimplemented without rewriting the contract.

3. Failure-mode-explicit. The contract names the failure modes the skill surfaces, separately from the success path. A reviewer who does not see failure modes in the contract knows the skill is incomplete, regardless of how clean the success path reads.

Skills that satisfy all three scale across agents, across implementations, across time. Skills that satisfy fewer decay quickly into code-as-skill, at which point the review bottleneck reappears and the markdown layer stops earning its presence.

What you do with this pattern

If you are building on Claude Code at any scale, the skills directory is the highest-payoff surface. Two moves that compound:

  • Audit one existing skill for the three properties above. If any property fails, rewrite the contract before adding the next skill. One rewritten contract is worth three new skills with inherited drift.
  • Separate config from behavior. Move every toggle, every threshold, every environment-specific value out of the code and into a flag file. The next regression hunt runs in five minutes instead of fifty.

Both moves compound across the agent ecosystem you build on top. The QA-automation repo is one concrete shape of the pattern; the shape generalizes to any Claude Code surface where review authority is the scarce resource.

Aman Bhandari. Operator of an AI-engineering research lab running Claude Opus as the coaching partner, plus a QA-automation surface shipping against a real sprint workload. Public artifacts: claude-code-agent-skills-framework and claude-code-mcp-qa-automation. github.com/aman-bhandari.