Why "Local LLM Operation" Now
Cloud LLMs are easy and high-performance, but cost predictability, handling confidential data, latency (response delay), and API limits can become walls. That is why self-hosting—running a large language model (LLM) on hand or in your own environment—is drawing attention.
The appeal of local operation is broadly threefold.
- Don't send data outside: good affinity with RAG (retrieval-augmented generation) that handles internal documents and customer information
- Control cost: the more usage increases, the more advantageous it tends to be over metered billing
- Optimize to your liking: high freedom in model swapping, quantization, inference settings, and monitoring
In this article, centered on Ollama, which developers can touch quickly, and vLLM, a high-throughput inference server for production, we summarize the crux of operations that doesn't end at just "running it."
The Big Picture: Dividing Use Between Ollama and vLLM
Clarifying the roles first reduces confusion.
Ollama: The Shortest Route for Local Development and Prototyping
Ollama is a tool that handles model fetching, startup, and execution all together, and is strong for "just try it locally for now." It is easy to install on Mac/Windows/Linux, and model management is simple. It is handy when running a PoC within a team.
vLLM: An Inference Platform Strong at Production and High Load
vLLM is a server that, through inference optimization (especially PagedAttention), easily earns throughput on the same GPU. Many configurations can be provided as an OpenAI-compatible API, so migration on the app side is relatively easy. Its true worth emerges in cases of long-running operation, increasing concurrent requests, and team operation.
A recommended way of thinking
Start with Ollama for model selection then internal evaluation then, once requirements are fixed, vLLM for production—that goes smoothly.
Preparation: A Realistic Talk About Hardware and Model Selection
GPU/VRAM rough guide
For local LLMs, required resources change with "model size" and "quantization (lowering precision to lighten it)." A rough guide image is as follows.
- 7B–8B: realistic from about VRAM 6–10 GB with quantization (development/chat use)
- 13B–14B: VRAM 12–24 GB is safe (balance of quality and speed)
- 30B+: VRAM in the 48 GB class, or multiple GPUs in view (a serious inference platform)
Of course CPU inference is also possible, but perceived speed depends on the use. If used daily as an internal tool, GPU operation has less stress.
Be decisive: split models "by use"
People tend to chase an all-purpose model, but in operations, by-use is stable.
- Chat/summarization: a model good at general instruction following
- Code support: a model strong in code
- Internal QA: hits and misses are often decided by RAG design (search quality)
Further, if you emphasize "Japanese quality," comparing in advance with derivative models strong in Japanese or evaluation benchmarks (e.g., a Japanese QA set) saves trouble later.
