This just showed up a couple of days ago on GitHub. Note that ANE is the NPU in all Apple Silicon, not the new 'Neural Accelerator' GPU cores that are only in M5.
(ggml-org/llama.cpp#10453) - Comment by arozanov
Built a working ggml ANE backend. Dispatches MUL_MAT to ANE via private API.
M4 Pro results:
4.0 TFLOPS peak at N=256, 16.8x faster than CPU
MIL-side transpose, kernel cache, quantized weight support
ANE for prefill (N>=64), Metal/CPU for decodeCode: https://github.com/arozanov/ggml-ane
Based on maderix/ANE bridge.
[link] [comments]



