I’ve been experimenting with different LLMs recently, and one challenge I keep running into is managing the workflow when comparing outputs across models.
For example, when testing prompts or agent-style tasks, I often want to see how different models handle the same instruction. The issue is that switching between different interfaces or APIs makes it harder to keep the conversation context consistent, especially when you're iterating quickly.
Some things I’ve been wondering about:
- Do most people here just stick with one primary model, or do you regularly compare several?
- If you compare models, how are you keeping prompt context and outputs organized?
- Are you using custom scripts, frameworks, or some kind of unified interface for testing?
I’m particularly interested in how people here approach this when working with local models alongside hosted ones.
Curious to hear how others structure their workflow when experimenting with multiple LLMs.
[link] [comments]




