Nemotron super 120b is out and I had a bit of trouble getting it running on my strix halo and llama.cpp due to a tensor shape error.
I realize I may just be a dumbass and everyone else may have figured this out with no issues, but I wanted to post this in case someone else ran into problems.
I have an AMD Ryzen AI MAX+ 395 (Strix Halo), 128GB LPDDR5x unified memory, Radeon 8060S iGPU (gfx1151)
Model: Nemotron 3 Super 120B-A12B - 120B parameters (12B active per inference), 1M native context, hybrid MoE+SSM architecture
Executive Summary
| Method | Status | Memory | Notes |
|--------|--------|--------|-------|
| llama.cpp + GGUF Q4_K_M | Working | ~82GB model + KV | Tested, production-ready |
| vLLM 0.17 + BF16 | Untested | ~240GB | Requires tensor parallelism cluster |
The GGUF quantization works with llama.cpp. The BF16 route should work with vLLM but requires downloading ~240GB and ideally a multi-GPU setup. We have not tested BF16 because we lack a cluster.
Architecture Notes
Strix Halo uses unified memory - the GPU accesses system RAM directly. BIOS VRAM settings of 1GB are correct; the iGPU uses shared memory through the fabric, not dedicated VRAM. This means your effective VRAM is system RAM minus OS overhead (~124GB usable).
What Works: llama.cpp + GGUF
BIOS Configuration:
- Above 4G Decoding: Enabled
- Re-Size BAR Support: Enabled
- UMA Frame Buffer Size: 1GB (unified memory handles the rest)
Kernel Parameters:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdttm.pages_limit=27648000 amdttm.page_pool_size=27648000"
These expand the TTM memory pool for GPU access to unified memory. Run sudo update-grub (Debian/Ubuntu) or sudo grub2-mkconfig -o /boot/grub2/grub.cfg (Fedora) after.
ROCm 7.2 Installation (Fedora):
sudo dnf install rocm-dev rocm-libs rocm-utils
sudo usermod -aG render,video $USER
Verify: rocminfo | grep gfx1151
llama.cpp Build:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && mkdir build && cd build
cmake .. -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151
make -j$(nproc)
The target specification is critical - without it, cmake builds all AMD architectures.
Model Download:
pip install huggingface_hub
huggingface-cli download unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF \
Q4_K_M/nvidia_Nemotron-3-Super-120B-A12B-Q4_K_M-00001-of-00003.gguf \
Q4_K_M/nvidia_Nemotron-3-Super-120B-A12B-Q4_K_M-00002-of-00003.gguf \
Q4_K_M/nvidia_Nemotron-3-Super-120B-A12B-Q4_K_M-00003-of-00003.gguf \
--local-dir ~/models/q4 --local-dir-use-symlinks False
Three shards totaling ~82GB. Shard 1 is 7.6MB (metadata only) - this is correct, not a failed download.
Server Launch:
./llama-server \
-m ~/models/q4/nvidia_Nemotron-3-Super-120B-A12B-Q4_K_M-00001-of-00003.gguf \
--port 8080 -c 393216 -ngl 99 --no-mmap --timeout 1800
Parameters:
- -c 393216: 384K context (conservative for memory safety)
- -ngl 99: Full GPU offload
- --no-mmap: Required for unified memory architectures
- --timeout 1800: 30-minute timeout for large context operations
Systemd Service (Fedora):
Note: On Fedora with SELinux enforcing, binaries in home directories need proper context.
Create service file:
sudo tee /etc/systemd/system/nemotron-server.service << 'EOF'
[Unit]
Description=Nemotron 120B Q4_K_M LLM Server (384K context)
After=network.target rocm.service
Wants=rocm.service
[Service]
Type=simple
User=ai
WorkingDirectory=/home/ai/llama.cpp
ExecStart=/home/ai/llama.cpp/build/bin/llama-server -m /home/ai/models/q4/nvidia_Nemotron-3-Super-120B-A12B-Q4_K_M-00001-of-00003.gguf --port 8080 -c 393216 -ngl 99 --no-mmap --timeout 1800
Restart=always
RestartSec=10
Environment=HOME=/home/ai
Environment=PATH=/usr/local/bin:/usr/bin:/bin
[Install]
WantedBy=multi-user.target
I tried the mxfp4 gguf, with no joy, but the q4 seems to be working very well. I’m able to get a comfortable 384k context and have been testing. I get 14-17 tok/sec on average. I had to up my timeout for longer operations that sometimes run a bit longer with larger context.
Hopefully this helps someone. Any suggestions for improvement are welcome as well. I’m not super great at this stuff, and other people posting things was how I was able to work it out.
[link] [comments]




