| TL;DR: My last post about testing TinyGPU attracted some interest. This is the follow-up. The Blackwell card is detected and the driver loads, but NVIDIA's GSP firmware fails to boot through TB5 (known issue, I'm working with tinygrad on it). While debugging that, I went down a rabbit hole and discovered that Apple's RDMA subsystem accepts Metal GPU buffers for zero-copy network transfers — something nobody has documented. I also found hidden The setup (for those who missed the last post)I'm running a 4-node Mac cluster (3x M3 Ultra + M5 Max MacBook Pro, ~1.5TB unified memory total) connected via Thunderbolt 5 with JACCL RDMA for distributed inference. I just got an RTX PRO 5000 Blackwell 72GB in a Razer Core X V2 and plugged it in to test TinyGPU. What happened with the Blackwell cardThe card is detected. macOS sees it on PCIe (link up, x4 @ 16 GT/s, 80 Gb/s TB5). TinyGPU's DriverKit extension loads and matches. BAR0 MMIO is mapped — I can read and write GPU registers. But NVIDIA's GSP firmware fails during initialization: I decoded the NOCAT error records and found But here's what I found while debuggingWhile researching whether NVIDIA eGPU VRAM could eventually participate in RDMA transfers, I tested what memory types Memory type validation results
Triple-registered buffer — zero-copy provenI created a single 64MB One buffer, three consumers, zero copies. Apple GPU writes are immediately visible to the RDMA subsystem because they're the same physical pages. This means: Hidden ibv_reg_dmabuf_mr — Apple compiled it but hid itUsing
``` ibv_reg_dmabuf_mr (0x4EC8) → vtable dispatch → mlx5_reg_dmabuf_mr (libmlx5) → allocates MR struct, forwards all 6 args → ibv_cmd_reg_dmabuf_mr → builds 0x130-byte ioctl command struct → execute_ioctl → SENDS DIRECTLY TO THE KERNEL ``` Apple built and ships a complete DMA-BUF RDMA memory registration pipeline — from userspace through the Mellanox provider to a kernel ioctl. The only remaining question is whether `IORDMAFamily.kext` accepts or rejects the command. Why this mattersZero-copy GPU → RDMA is real on macOS. Metal compute results can be sent to remote cluster nodes without any intermediate copies. JACCL/MLX could leverage this for faster tensor parallelism. The Hardware
Test codeAll test programs are Objective-C, compiled with: Note: Where I need helpI'm going after this from multiple angles but there's more here than one person can cover. If any of this is in your wheelhouse: 1. TinyGPU GSP firmware init on TB5 (tinygrad#15843) The The bigger pictureApple builds capabilities, uses them internally, and hides them from public APIs. The question is whether ibv_reg_dmabuf_mr is functional or dead code, and that's a Ghidra session away from being answered. Here's why this matters for everyone, not just people with clusters: If GPUDirect RDMA works on macOS, any Mac with Thunderbolt becomes a hybrid AI workstation. Plug an NVIDIA GPU into your Mac via a $200 eGPU enclosure and the GPU's VRAM becomes part of your Mac's memory pool — accessible to Metal, to RDMA, to your inference stack, with zero-copy transfers. Your Mac's 128GB/256GB/512GB unified memory + the GPU's 24/48/72GB GDDR7, all working together. No Linux box. No separate PC. One cable. Right now TinyGPU lets you run CUDA compute on a Mac. What we're trying to prove is that the GPU's memory can also participate in Apple's RDMA network — meaning multi-Mac clusters can share NVIDIA VRAM across nodes. ~1.5TB of unified memory + 72GB GDDR7, all RDMA-capable, on hardware you can buy today.This is a follow-up to my TinyGPU testing post. All test programs (Objective-C, ~50 lines each) and research notes available — happy to share the repo if there's interest. Also posted NOCAT decode findings on tinygrad#15843 if you want to help debug the TB5 GSP init. [link] [comments] |
Follow-up: Trying to make NVIDIA GPUs plug-and-play on Macs. Found hidden RDMA symbols Apple doesn't want you to see — zero-copy GPU memory sharing might already work.
Reddit r/LocalLLaMA / 5/7/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The author successfully detects an NVIDIA RTX PRO 5000 Blackwell GPU connected to a Mac over Thunderbolt 5, but NVIDIA’s GSP firmware fails to initialize due to an RDMA/peer-visibility issue (“FBFLCN UNRECOGNIZED_CLIENT”), which is reported as a known Tinygrad/macOS TB5 enclosure problem affecting NVIDIA GPUs.
- During debugging, the author discovered undocumented Apple RDMA capabilities that can accept Metal GPU buffers for zero-copy network transfers, potentially enabling more efficient distributed inference workflows on macOS.
- The author also found hidden `ibv_reg_dmabuf_mr` symbols in Apple’s `libibverbs`, suggesting that GPUDirect RDMA-style functionality might be achievable on macOS without kernel changes.
- The post lays out what the author has found and explicitly requests help from the community—especially those familiar with NVIDIA GSP firmware initialization and macOS RDMA internals—to make plug-and-play NVIDIA GPU networking on Macs feasible.
- The overall direction is that while the Blackwell card faces a firmware/enclosure compatibility blocker, the underlying macOS RDMA/Metal plumbing might already support zero-copy GPU memory sharing in principle.
Related Articles

Black Hat USA
AI Business

Build Interactive Agents with Generative UI
The Batch

Barry Diller trusts Sam Altman. But ‘trust is irrelevant’ as AGI nears, he says.
TechCrunch

Released my first open source project — MIT-licensed CLI for AI-assisted commit messages
Dev.to

Stop Credentialing Your AI Agents Like It's 2019
Dev.to