[Release] Carnice-9b-W8A16-AWQ – AWQ Quantization Optimized for vLLM + Marlin on Ampere GPUs (Single-GPU)

Reddit r/LocalLLaMA / 4/12/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The post releases an 8-bit symmetric AWQ quantized version (W8A16) of the kai-os/Carnice-9b model, optimized for single-GPU inference on Ampere (RTX 30-series) using vLLM with the Marlin kernel.
  • Carnice-9b is described as a Qwen/Qwen3.5-9B fine-tune adapted for text-only/agentic use (visual components removed) and built on Qwen3_5ForCausalLM, with a compatibility re-wrapping step to load correctly in vLLM as Qwen3_5ForConditionalGeneration.
  • The author notes vLLM does not yet natively support the underlying Qwen3_5ForCausalLM architecture (referencing a pending vLLM PR), and the release workaround targets correct serving via the --language-model-only flag.
  • Reported vLLM benchmarks on a single RTX 3090 using Marlin show ~1,994 tokens/s average prompt throughput and ~222 tokens/s average generation throughput.
  • A sample vLLM serve command is provided, and the author asks for feedback to improve future quantization releases and benchmark performance in Hermes agent environments.

Hey r/LocalLLaMA,

I am releasing my first model quantization: an 8-bit symmetric AWQ (W8A16) of kai-os/Carnice-9b, specifically optimized for Ampere GPUs (RTX 30-series) using vLLM with the Marlin kernel on a single-GPU inference setup.

kai-os/Carnice-9b is a specialized fine-tune of Qwen/Qwen3.5-9B that removes the visual components and adopts the Qwen3_5ForCausalLM architecture for pure text/agentic use (Hermes Agent harness). This architecture is not yet natively supported by vLLM (pending PR #39316).

To enable seamless loading, the quantized checkpoint re-wraps the weights into the Qwen3_5ForConditionalGeneration architecture (matching the original Qwen/Qwen3.5-9B configuration). This allows vLLM to serve it correctly with the --language-model-only flag for text-only inference.

Model: https://huggingface.co/TurbulenceDeterministe/Carnice-9b-W8A16-AWQ

Benchmark highlights (vLLM bench on random dataset, single RTX 3090 + Marlin):
• Average prompt throughput: ~1,994 tokens/s
• Average generation throughput: ~222 tokens/s

I'm gonna run some benchmarks specific to the Hermes agent environment (Terminal Bench Lite and YC bench). From a quick vibecheck it seems pretty good

Quick vLLM usage (single GPU):

vllm serve TurbulenceDeterministe/Carnice-9b-W8A16-AWQ \ --max-model-len auto \ --reasoning-parser qwen3 \ --language-model-only \ --tensor-parallel-size 1 

I would greatly appreciate your feedback on how to improve future quantizations. Thank you!

submitted by /u/Imakerocketengine
[link] [comments]