AI Navigate

I need help with testing my llama.cpp Deepseek Sparse Attention (DSA) implementation (someone GPU-rich)

Reddit r/LocalLLaMA / 3/20/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The author has an initial proof-of-concept llama.cpp Deepseek Sparse Attention (DSA) implementation and seeks a full benchmark to verify it works correctly.
  • They note that the performance difference between dense and sparse attention is subtle and only visible on very complex problems, making benchmarking essential.
  • They require access to a machine with at least 768 GB of VRAM for a few hours to run lineage-bench on DeepSeek v3.2 Speciale in Q8_0 within their llama.cpp deepseek-dsa branch, and compare against sglang fp8 tests using GGUFs.
  • They attempted to use Vast.ai with 8x RTX PRO 6000 but encountered CUDA OOM errors fitting indexer tensors, indicating the need for more time or more powerful hardware.
  • They offer either direct access or via a human proxy to perform the benchmarking tasks.

I have initial proof-of-concept implementation ready and now I want to confirm that it works correctly. Unfortunately the difference between the model performance with dense vs sparse attention is subtle and it's visible only for very complex problems. Basically you need a full benchmark run to make sure the implementation works correctly. I can't do it on my Epyc 9374F + RTX PRO 6000 workstation as it would take hundreds of hours.

What I need is an access to a machine with at least 768 GB of VRAM (or more) for a few hours to run lineage-bench (either a full run or limited lineage-256/lineage-512) on DeepSeek V3.2 Speciale in Q8_0 in my llama.cpp deepseek-dsa branch with dense and sparse attention and compare results with my sglang fp8 tests. It may be either direct or via human proxy. I have GGUFs ready.

I tried to do it on vast.ai rented 8x RTX PRO 6000 instance, but had problems fitting the model with indexer tensors on this configuration (CUDA OOM errors). So either more time to research this or more powerful hardware is needed - and I feel that I already burned enough money on this.

submitted by /u/fairydreaming
[link] [comments]