I have initial proof-of-concept implementation ready and now I want to confirm that it works correctly. Unfortunately the difference between the model performance with dense vs sparse attention is subtle and it's visible only for very complex problems. Basically you need a full benchmark run to make sure the implementation works correctly. I can't do it on my Epyc 9374F + RTX PRO 6000 workstation as it would take hundreds of hours.
What I need is an access to a machine with at least 768 GB of VRAM (or more) for a few hours to run lineage-bench (either a full run or limited lineage-256/lineage-512) on DeepSeek V3.2 Speciale in Q8_0 in my llama.cpp deepseek-dsa branch with dense and sparse attention and compare results with my sglang fp8 tests. It may be either direct or via human proxy. I have GGUFs ready.
I tried to do it on vast.ai rented 8x RTX PRO 6000 instance, but had problems fitting the model with indexer tensors on this configuration (CUDA OOM errors). So either more time to research this or more powerful hardware is needed - and I feel that I already burned enough money on this.
[link] [comments]




