I think token budget is becoming part of agent workflow design.
If every run feels expensive, people under-test. They save quota, overthink prompts, and avoid the repetition that reveals failure modes.
If every run feels cheap, people can over-delegate. They generate more output than they can review.
So the useful question is not "which model is best?"
It is:
Which step deserves which level of model?
My current rule:
- cheap / lower-reasoning runs for bounded, reviewable repetition
- stronger models for ambiguity, hard judgment, debugging, and review
- human review for acceptance
Do not spend premium reasoning on an unclear task.
First make the task smaller.
Then choose the model.
[link] [comments]




