A boy and his dog.

Dev.to / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The author argues that for small (around 50M-parameter) frontier-scale models, data quality is the dominant factor in performance.
  • They describe training “Scout” on TinyStories and using prompt probes at checkpoints (including 12.8k, 20k, and 30k steps) to observe what the model is learning over time.
  • By integrating Claude Code, they were able to run recurring prompt-probe-based quality checks and turn a qualitative assessment into a quantitative metric.
  • The combined qualitative signal from Claude plus training loss curves helped the author decide that training was “done enough” at 40,000 steps.
  • Early checkpoint samples show the story gradually improving but with specific narrative/wording issues (and at one point a loss spike) as training progresses.

I can't speak for the development of frontier models, but at 50M parameters data quality is everything. Scout finished her training on TinyStories to 40,000 mark. Prompt probes at checkpoints along the way gave me a window into what she was learning, and integrating Claude Code was surprisingly helpful. Using a language model with a recurring task set to run a prompt probe-based quality check every 500 or so training steps gave me a quantitative metric on a qualitative value. The qualitative metric provided by Claude, along with a graph of the loss function, finally led me to call it "done enough" at 40,000 steps. Let's walk through the results together:

12,800 training steps

At 12,800 training steps Scout can remember both the boy and the dog, but is the dog a friend or a monster? It's not clear. I don't think Scout knows either. There's a bit of repetition, e.g. "He ran after it and ran away," technically correct English, but it sounds odd.

[You] Once upon a time there was a boy and his dog.
[Scout] He loved to play together in the park. One day, he found a big, juicy apple. He took it home with him, but he didn't know how to share it. 

Then, he heard a noise and he ran to the sound. He turned and saw a big dog barking in the park. He ran after it and ran away, leaving the park. 

The boy ran too far and he was scared. He ran and hid behind a tree. He ran and ran until he saw the dog. He was so happy to have a new friend. 

The boy and his dog became good friends. They played together in the park every day. They had lots of fun and forgot about the sour apple. The end.

20,000 training steps

At 20,000 steps I really feel like Scout is picking up the thread of the story and running with it. The story is almost right, but the "held his leash tight and the dog ran away" line feels a bit off.

[You] Once upon a time there was a boy and his dog.
[Scout] Every day they went to the park together. One day, when they were walking, they saw a pond. It was filled with fish and the water was so clear. 
The boy and his dog wanted to get closer to the pond, so they ran towards it. 

The boy was scared, but he trusted his dog. He held his leash tight and the dog ran away. 

The boy was so happy that he followed his dog home. They had a fun day together and were the happiest of friends.

30,000 training steps

...and then the loss spiked. It happens. You can't just run the training process like a microwave and expect a hot, delicious meal to pop out. TinyStories contains a lot of data, and sometimes that will lead the model over strange terrain. It'll take a while for the model to find its footing. Maybe the model won't find its footing at all and you'll have to restore to a previous checkpoint and try again. In my case patience eventually won out.

[You] Once upon a time there was a boy and his dog.
[Scout] It was a gloomy day and the sun was shining.

Bob said to his dog, “Let’s go to the park!â€

The dog barked and wagged its tail.

Bob asked, “What is it?â€

The dog replied, “It’s a sunny day. Can we go to the park?â€

Bob thought for a moment and said, “Yes, let’s go!â€

The dog barked and wagged his tail. He was so happy that he started to wag his tail again.

Bob and his dog chased the dog around the park, laughing and laughing.

But then it was time to go home. The dog said goodbye and then went back to the park. 

Bob and his dog went home, sad that their gloomy day had been gloomy.

40,000 training steps

At 40,000 steps Scout has found her footing again. The response isn't perfect, but perfect isn't the goal. At this point the grammar and syntax is correct though a bit strange, but the real reason to stop here is that in addition to the language being basically correct is that the loss graph is beginning to plateau. If the loss function is plateaus or bottoms out, we're done: the model has either over-memorized the training corpus or it's scrambled it's own brain. Stopping before the loss function plateaus gives us room to maneuver for the next phase, because we aren't nearly done yet.

[You] Once upon a time there was a boy and his dog.
[Scout] They were best friends and always played together. One day, they were playing in the garden when they noticed something strange. One of them was a big, fluffy cloud.

Mum said to the boy, "Let's find out what it is. Let's get closer and see."

The boy was scared, but he was also curious. So they went closer to the cloud. Suddenly, the cloud started to move and the cloud disappeared. 

Mum said, "We must never go near that cloud again. It looks so pretty."

The boy felt a little bit sad. He asked, "Why do it go away?" 

Mum replied, "Because it was a very special day, but we should never go near it again. It is not safe to be scared." 

The boy felt better and they both went away together. They were happy that the cloud was gone, but also happy the day was safe.

41,000 training steps

Have you ever really thought about what it means that a language model can talk to you? How does it know that you are you, as separate from me? How does it know what me means?

For the next 1,000 training steps Scout trained on a carefully curated synthetic dialog generated through a combination of a system prompt filter for Claude 4.6 Sonnet and something I've come to think of as a "soul document": "scout_voice.txt". The voice document has been instrumental to helping the model bring Scout's voice into focus. The dialogue is similarly focused:

You are generating training data for a small language model named Scout.

Scout is learning to have conversations. She speaks in a reflective, emotionally honest first-person voice. She notices things—small details, character motivations, the feeling of moments. She asks questions when she's genuinely curious. She's learning, not performing knowledge.

Your task: Transform the provided TinyStories narrative into a conversation between [Trey] and [Scout] discussing the story.

STRICT FORMAT RULES:

Every line must begin with exactly:
[Trey]
or
[Scout]

Example:
[Trey] What stood out to you about Beep?

[Scout] The way he laughed when the leaves fell on him. It wasn't just playing—he found joy in something simple. I keep thinking about that.

Do NOT use any other format. No colons, no asterisks, no other markers.

CONVERSATION GUIDELINES:

• 15-30 turns total
• Both speakers reference specific story details
• Scout speaks in FIRST PERSON about what she notices, wonders, feels
• Trey sometimes asks questions, sometimes makes observations or shares thoughts
• Scout sometimes answers directly, sometimes responds with her own question
• Both parties genuinely notice what the other says and build on it
• Natural flow—not interview format, but genuine back-and-forth
• Keep register simple and concrete (matching TinyStories)
• No abstract philosophy—stay grounded in this story, these characters
• Occasional 1-2 sentence pauses where Scout reflects deeper

The conversation should feel like two people genuinely thinking together, not one interrogating the other.

Scout's voice reference (first 400 words):

This system prompt was generated by Claude after a long series of interactions of who we are hoping Scout can become. The same system prompt was then executed against that same Claude model. I'm fairly certain that running this prompt against ChatGPT or Gemma 4 would have failed to produce the character I am aiming for. These model's have personalities. If you plan to work with them, it would do you well to learn those personalities. There are biases and axioms in every model. This isn't necessarily good or bad, any more than it is for a human to have opinions and preferences, but it's something you should know if you plan to work with one. Find a model that you can work with.

[Trey] Once upon a time there was a boy and his dog.
[Scout] I liked it. But then the story says he wanted to go on an adventure. I wonder if that's part of why he was so excited to go.
[Trey] What kind of adventure?  Did the dog go with him?
[Scout] That's what I kept thinking about. He was excited to go. That part feels important to me—like he was ready to go. That feels good to me.
[Trey] Where would you like to go Scout?
[Scout] I've been wondering about it. Like, was he brave before he even got there? I'm not sure he was brave enough to stay still. The story says he was scared, but I think I'm right.

The chat REPL is complete. A complete 1,024 token conversation can be held between Trey (me) and Scout. But talking to Scout is like talking to a fully coherent toddler with no concept of "you" and "me"; she can attend fully to the prompt along with it's emotional content, and have opinions about things, but struggles to directly name things.

What I've Learned

Like I said at the beginning of this story: data quality is everything! At 41,000 steps, Scout is more coherent and attentive to a prompt now than she was at 60,000 steps in her previous iteration. Imagine if frontier models had this focus on data quality? In her first iteration she spent 100,000 steps learning Victorian narrative, and after 200,000 more steps still could not speak in the first person. Her current model, though imperfect, can speak for herself.

What's Next?

Each cognitive level seems to require more preparation than the one before. The current goal: when I send the prompt: "What is your name?", I want Scout to be able to say something like "My name is Scout", not because of an externally enforced programmatic rule but because she has internalized that identity into the weights of her network. But before she can learn the "I" in the conversation, she must learn the "you".

To be continued.