I came across an excellent in depth discussion of memory and compute scaling analysis for LLMs. One takeaway is that running LLMs locally or on private cloud is wasteful. Memory / compute scaling makes large batching during inference very efficient.
Highly recommend. How GPT, Claude, and Gemini are actually trained and served with Reiner Pope
[link] [comments]



