Nanochat vs Llama for training from scratch? [P]

Reddit r/MachineLearning / 4/24/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The author is training a model entirely from historical data and previously used Nanochat, which worked well for pretraining and SFT but caused interoperability issues afterward.
  • Although some effort exists to make Nanochat models compatible with Transformers, the version they trained with does not output a Transformers-compatible model.
  • They are now considering switching to the Llama architecture and Hugging Face Transformers’ Trainer to create an open-source project that others can easily use via Transformers.
  • The post weighs tradeoffs between Nanochat benefits (e.g., auto-scaling via a depth parameter) and the need for standard interoperability, asking whether Llama is the best choice or if there’s a better alternative.
  • The author is also considering either rebuilding the workflow with Llama/Transformers or reusing Nanochat and later writing an export script to convert Nanochat outputs into Hugging Face-compatible formats.

Hey all - I'm engaged in a project training a model entirely on historical data, which I've posted about before on this subreddit. My last training run was done using Nanochat, and while that was very successful for pretraining and SFT of the initial model, I'm finding that while nanochat is great for getting it up and running, it's not so great for interoperability. There has been a little bit of work done to make nanochat transformers-compatible, but the latest version of nanochat (which I trained with) doesn't produce a transformers-compatible model.

So, I'm considering my next training run using the Llama architecture and the transformers 'trainer' class. I have assembled a much larger dataset for pretraining, and I want this to be an open-source project that people can access using transformers. However, I know that there are advantage to nanochat (such as the auto-scaling --depth parameter). All that said, is Llama the best potential architecture for this scenario? Or is there a better option that I could use here? Or do I just go with Nanochat again, and hope that I can build out a nanochat-to-HF export script on the other side?

submitted by /u/centerstate
[link] [comments]