Qwen3.5-27B-IQ3_M, 5070ti 16GB, 32k context: ~50t/s

Reddit r/LocalLLaMA / 3/12/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The post reports that Qwen3.5-27B runs locally on a 5070ti 16GB card with 32k context and achieves unexpectedly high prompt throughput.
The results build on the willbnu/Qwen-3.5-16G-Vram-Local repo but require a specific locked profile and configuration to reproduce.
Benchmark numbers show prompt throughput around 462.7–478.3 t/s and generation around 48 t/s across configurations, highlighting strong prompt speed.
The final setup uses Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-IQ3_M.gguf with 32,768 context, 99 GPU layers, iq4_nl caches, batch 1024 / 512, 6 threads, ctx-checkpoints 0, flash attention on, and port 8004, illustrating how to replicate.

I wanted to share this one with the community, as i was surprised I got it working, and that its as performant as it is. IQ3 is generally really really bad on any model... but ive found that not to be the case on Qwen3.5 since the 27B is just so capable.

My starting point was this: https://github.com/willbnu/Qwen-3.5-16G-Vram-Local but I wasnt able to fully reproduce the results seen until i configured as below.

Benchmark comparison - Baseline (ctx-checkpoints=8, Q3_K_S): prompt ≈ 185.8 t/s, gen ≈ 48.3 t/s — qwen-guide/benchmark_port8004_20260311_233216.json

ctx-checkpoints=0 (same model): prompt ≈ 478.3 t/s, gen ≈ 48.7 t/s — qwen-guide/benchmark_port8004_20260312_000246.json
Hauhau IQ3_M locked profile (port 8004): prompt ≈ 462.7 t/s, gen ≈ 48.4 t/s — qwen-guide/benchmark_port8004_20260312_003521.json

Final locked profile parameters - Model: Qwen3.5-27B-Uncensored-HauhauCS-Aggressive-IQ3_M.gguf - Context: 32,768 - GPU layers: 99 (all 65 layers on GPU) - KV cache types: K=iq4_nl, V=iq4_nl - Batch / UBatch: 1024 / 512 - Threads: 6 - ctx-checkpoints: 0 - Reasoning budget: 0 - Parallel: 1 - Flash attention: on - Launcher script: scripts/start_quality_locked.sh - Port: 8004

submitted by /u/ailee43
[link] [comments]

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

日経XTECH

Run Claude Opus 4.6 via OpenAI-compatible API using your existing Pro/Max subscription

Dev.to

Jupyter AI Extension - Multi-LLM Support

Dev.to

How to Build an AI Team: The Solopreneur Playbook

Dev.to

Getting Started with AI Agents

Dev.to

Qwen3.5-27B-IQ3_M, 5070ti 16GB, 32k context: ~50t/s

Key Points

Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

Run Claude Opus 4.6 via OpenAI-compatible API using your existing Pro/Max subscription

Jupyter AI Extension - Multi-LLM Support

How to Build an AI Team: The Solopreneur Playbook

Getting Started with AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer