Getting Started with RamaLama on Fedora

Dev.to / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • RamaLama is an open-source, container-based tool for running AI models locally, aiming to make inference setup predictable by using OCI images matched to detected hardware.
  • On Fedora, the guide recommends using Podman as the default container engine and ensuring adequate disk space and RAM for model sizes (about 8GB+ for smaller models and 16GB+ for 7B+).
  • Installation is straightforward via the Fedora default repositories using `sudo dnf install ramalama`, followed by a version check with `ramalama version`.
  • RamaLama automatically inspects the system for GPU support (falling back to CPU if needed), pulls an OCI image that includes `llama.cpp` for inference, and caches models locally to avoid repeated downloads.
  • The tool supports multiple model registries (defaulting to Ollama) and lets users select sources via transport prefixes such as `ollama://` and `huggingface://` (or `hf://`).

RamaLama is an open-source tool built under the containers organization that makes running AI models locally as straightforward as working with containers. The goal is to make AI inference boring and predictable. RamaLama handles host configuration by pulling an OCI (Open Container Initiative) container image tuned to the hardware it detects on your system, so you skip the manual dependency setup entirely.

If you already work with Podman or Docker, the mental model is familiar. Models are pulled, listed, and removed much like container images.

Prerequisites

Before installing RamaLama, make sure you have the following:

  • A Fedora system (this guide uses Fedora with dnf)
  • Podman installed, RamaLama uses it as the default container engine
  • Sufficient disk space for model storage (models range from ~2GB to 10GB+)
  • At least 8GB RAM for smaller models; 16GB+ recommended for 7B+ parameter models

Installation

On Fedora, RamaLama is available directly from the default repositories:

sudo dnf install ramalama

Once installed, verify the version:

ramalama version

Expected output:

ramalama version x.x.x

How It Works

On first run, RamaLama inspects your system for GPU support and falls back to CPU if no GPU is found. It then pulls the appropriate OCI container image with all the inference dependencies baked in, including llama.cpp, which powers the model execution layer. Models are stored locally and reused across runs, so the pull only happens once per model.

Model Registries

RamaLama supports pulling models from multiple registries. The default registry is Ollama, but you can reference models from any supported source using a transport prefix:

Registry Prefix
Ollama ollama:// or no prefix
Hugging Face huggingface:// or hf://
ModelScope modelscope:// or ms://
OCI Registries oci://
RamaLama Registry rlcr://
Direct URL https://, http://, file://

Pulling and Running Models

From Ollama (default)

ramalama run granite3.1-moe:3b

This pulls the granite3.1-moe:3b model from the Ollama registry and drops you into an interactive chat session. On first run, the model is downloaded to local storage; subsequent runs reuse it.

From Hugging Face

ramalama run huggingface://MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF

Note: Some newer Hugging Face models may fail with a gguf_init_from_file_impl: failed to read magic error due to format incompatibilities with llama.cpp. When that happens, look for a pre-converted GGUF version of the same model on Hugging Face by searching the model name with "GGUF" appended. In this case, MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF worked as a compatible alternative.

Useful Flags

Set Context Window Size; --ctx-size / -c

By default, RamaLama does not override the model's native context length. For llama3.1:8b, that default is 131072 tokens, which requires ~16GB of KV cache allocation, well above what most dev machines can handle.

Use the -c flag to cap the context size:

ramalama run -c 16384 llama3.1:8b

A context size of 16384 tokens requires ~2GB of KV cache for llama3.1:8b. You can use the KV Cache Size Calculator to find the right value for your available memory and target model. On memory-constrained machines, this flag is really helpful.

Set Temperature; --temp

Temperature controls the randomness of the model's output. The default is typically around 0.8. Setting it to 0 makes the model more deterministic:

ramalama run --temp 0 granite3.1-moe:3b

A temperature of 0 is useful for factual Q&A or benchmarking where you want consistent, reproducible outputs. Keep in mind it reduces randomness, not hallucination. If the knowledge is absent from the model's training data, --temp 0 will just make it consistently wrong.

Select Inference Backend; --backend

RamaLama auto-detects the best backend for your hardware, but you can override it explicitly:

ramalama run --backend vulkan granite3.1-moe:3b   # AMD/Intel or CPU fallback
ramalama run --backend cuda granite3.1-moe:3b     # NVIDIA
ramalama run --backend rocm granite3.1-moe:3b     # AMD ROCm

On systems without a GPU, RamaLama falls back to CPU inference automatically.

Enable Debug Output; --debug

--debug is a global flag and must be placed before the subcommand:

ramalama --debug run granite3.1-moe:3b

This prints the underlying container commands RamaLama executes, hardware detection steps, and registry fetch details. Useful when troubleshooting model compatibility issues, unexpected behavior, or hardware detection problems.

Managing Models

List locally stored models:

ramalama list

Pull a model without running it:

ramalama pull llama3.1:8b

Remove a model from local storage:

ramalama rm llama3.1:8b

Serving a Model as an API

RamaLama can expose a model as an OpenAI-compatible REST endpoint:

ramalama serve granite3.1-moe:3b

This starts a local server on port 8080 by default. You can point any OpenAI-compatible client at it without changing how those clients are written. Useful for integrating a local model into applications, RAG pipelines, or tooling like LangChain and LlamaIndex.

Web UI

When running ramalama serve, a browser-based chat interface is available at http://localhost:8080 by default. To disable it:

ramalama serve --webui off granite3.1-moe:3b

The web UI is powered by the llama.cpp HTTP server's built-in interface and gives you a quick way to interact with the model without writing any client code.

Things to Watch Out For

  • Model format compatibility: Some Hugging Face models require a pre-converted GGUF version to work with RamaLama. Stick to GGUF-format models when in doubt.
  • Memory and context size: Always check the model's default context length before running on a memory-constrained machine. Use -c to cap it appropriately.
  • Model size vs. accuracy: Smaller models (3B) are fast and lightweight but may lack knowledge on niche topics. For factual accuracy, 7B+ models are noticeably more reliable.
  • --debug flag placement: It must come before the subcommand, i.e. ramalama --debug run not ramalama run --debug.
  • RamaLama is still in active development: The project moves fast. Flag names, behaviors, and supported features can change between versions. When in doubt, check ramalama --help or the official docs.

References