広告

Agentic Shell - CLIエージェント適応レイヤー

Dev.to / 2026/3/30

💬 オピニオンDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

要点

  • 著者は、プロジェクト間でエージェント風のコマンド(例:cli-agent シェルからのもの)の実行方法を標準化する「Agentic Shell」CLI適応レイヤーをオープンソース化した。
  • 学習/検証の出力を使って進行を制御し、エージェント主導の「commit/end」判断を取り除くことで、エージェント型の研究ループをより決定論的にラップする方法を説明している。
  • この投稿では、決定論的アプローチと、autoresearch/Claude Code のようなワークフローを対比している。後者は、結果が良好かどうか、そして継続すべきかどうかをエージェントの判断に委ねる。
  • 良いエージェント用ハーネスでは、(日単位の)長時間にわたる自律実行が可能であり、また対話的にモデルが参加する方が、特定のユーザーワークフローにより適合する可能性があることに触れている。
  • 著者は、ループを繰り返すことでコンテキストウィンドウを拡張でき、時間の経過に伴う増大を管理するために要約/圧縮への依存が次第に高まっていく点を強調している。

Hey folks,

TLDR: Spent the today writing an adaptation of cli-agent shell requests, having coded the same across multiple agents on several other projects and open sourced it.

So since the advent of autoresearch and, to be frank, wayyy before then, when Tyson Fury taught many of us how to use coding agents, many of us have been experimenting with ways of running coding agent harnesses in deterministic frameworks, moving beyond that of Wiggium Loop itself:

A wild wiggium appears

while true; do
    cat prompt.md | claude -p
done

As part of my own fork of autoresearch, I put a more deterministic wrapper around it, removing the ability for the agent to decide whether or not to commit, and instead having it based on the output of model training and validation phases.

I should note here that Andrej has stated he prefers to not do this with his own autoresearch, I saw a tweet on this and also heard him mention he prefers an interactive approach during his interview with Sarah Guo

If we take a look at the autoresearch sequence:

Figure 1.1 - Autoresearch Sequence Diagram

As we can see, we are relying on the agents discretion before the loop decides to end. In addition to this, we are relying on the agent reliably determining that the results were indeed favourable.

For frontier models in a good agentic harness like Claude Code this has allowed Karpathy to run the loop for approximately 2 days, this approach as well allowed Karpathy to interact with the model during the research which might suit certain workflows.

The other thing to consider here is that with each loop, the context window grows and we see more summarisation. I have found that summarisation / compression has come a long way, I often have development sessions involving an AI that involves several compression cycles, however if this is running autonomously, then you are really at the mercy of the agentic harness compression configuration and as such you might see summarisation at a time that is not ideal.

All the above assumes frontier models, Opus 4.6 or GPT 5.4 aint cheap, and running loops 24/7 might not be within everyones budget.

This is where I have started to explore using loops with local models using OpenCode, while the constraints of these models are diminishing, I doubt I'd be able to have one run in a loop over two days just from a single prompt into OpenCode.

Instead I am looking at a pattern that combines autoresearch with ralph, but with deterministic gates and limiting the agent exposure to focused tasks within the workflow.

If we modified the approach detailed in Figure 1.1 to instead utilise a research_harness that could put some deterministic gates around the loop itself and whether the results of that loop get committed.

Figure 1.2 - Autoresearch with a deterministic harness

We could write a script that runs in a loop, passes in the prompt to a headless agent to read the prompt.md and then at the end deterministically measure the results and then programmatically commit the changes to github if they result in an improvement.

Hopefully you can see where this is heading. Not only would we be adding a bit more determinism into the workflow, we would also have the option to have this loop become effectively a ralph loop (providing we didn't pass in a session id as part of headless call).

Across several recent projects, I have coded up several similar headless calls with differing agentic harnesses, to the point I realised it was worth my while abstracting this into a single package that I can use to handle the process.

So I've created a harness that allows you to call the agents headlessly and then receive responses as either an AgentResponse for synchronous calls or StreamEvent for asynchronous streaming:

u/dataclass
class AgentResponse:
    response: str
    cost: float
    session_id: str | None = None

u/dataclass
class StreamEvent:
    type: str
    content: str
    cost: float = 0.0
    duration: float = 0.0
    session_id: str | None = None

Add the PyPI package using your favourite PyPI supported package manager and then use it as such

from agent_shell.shell import AgentShell
from agent_shell.models.agent import AgentType

shell = AgentShell(agent_type=AgentType.CLAUDE_CODE)

async for event in shell.stream(
    cwd="/path/to/project",
    prompt="Refactor the auth module",
    allowed_tools=["Read", "Edit", "Bash"],
    model="sonnet",
    effort="high",
    include_thinking=True,
):
    if event.type == "system":
        print(f"Session: {event.session_id}")
    else:
        print(f"[{event.type}] {event.content}")

Given that I think this is something I think other people will probably be looking to do, I have open sourced this as a project named agent-shell on github, so feel free to use for yourselves. It has more examples around using different agent types and streaming versus non-streaming.

It currenlty supports Claude Code and OpenCode, I am going to be working on the outstanding cli agents over the coming days.

広告