Follow-up: Trying to make NVIDIA GPUs plug-and-play on Macs. Found hidden RDMA symbols Apple doesn't want you to see — zero-copy GPU memory sharing might already work.

Reddit r/LocalLLaMA / 5/7/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author successfully detects an NVIDIA RTX PRO 5000 Blackwell GPU connected to a Mac over Thunderbolt 5, but NVIDIA’s GSP firmware fails to initialize due to an RDMA/peer-visibility issue (“FBFLCN UNRECOGNIZED_CLIENT”), which is reported as a known Tinygrad/macOS TB5 enclosure problem affecting NVIDIA GPUs.
  • During debugging, the author discovered undocumented Apple RDMA capabilities that can accept Metal GPU buffers for zero-copy network transfers, potentially enabling more efficient distributed inference workflows on macOS.
  • The author also found hidden `ibv_reg_dmabuf_mr` symbols in Apple’s `libibverbs`, suggesting that GPUDirect RDMA-style functionality might be achievable on macOS without kernel changes.
  • The post lays out what the author has found and explicitly requests help from the community—especially those familiar with NVIDIA GSP firmware initialization and macOS RDMA internals—to make plug-and-play NVIDIA GPU networking on Macs feasible.
  • The overall direction is that while the Blackwell card faces a firmware/enclosure compatibility blocker, the underlying macOS RDMA/Metal plumbing might already support zero-copy GPU memory sharing in principle.
Follow-up: Trying to make NVIDIA GPUs plug-and-play on Macs. Found hidden RDMA symbols Apple doesn't want you to see — zero-copy GPU memory sharing might already work.

TL;DR: My last post about testing TinyGPU attracted some interest. This is the follow-up. The Blackwell card is detected and the driver loads, but NVIDIA's GSP firmware fails to boot through TB5 (known issue, I'm working with tinygrad on it). While debugging that, I went down a rabbit hole and discovered that Apple's RDMA subsystem accepts Metal GPU buffers for zero-copy network transfers — something nobody has documented. I also found hidden ibv_reg_dmabuf_mr symbols in Apple's libibverbs that suggest GPUDirect RDMA might be possible on macOS without any kernel modification. Here's everything I found and where I need help.

https://preview.redd.it/d1086k5fcjzg1.png?width=3024&format=png&auto=webp&s=84e4ddd650c2a56637f63c4db0a85ff85d3d5fd0

The setup (for those who missed the last post)

I'm running a 4-node Mac cluster (3x M3 Ultra + M5 Max MacBook Pro, ~1.5TB unified memory total) connected via Thunderbolt 5 with JACCL RDMA for distributed inference. I just got an RTX PRO 5000 Blackwell 72GB in a Razer Core X V2 and plugged it in to test TinyGPU.

What happened with the Blackwell card

The card is detected. macOS sees it on PCIe (link up, x4 @ 16 GT/s, 80 Gb/s TB5). TinyGPU's DriverKit extension loads and matches. BAR0 MMIO is mapped — I can read and write GPU registers. But NVIDIA's GSP firmware fails during initialization:

RuntimeError: RPC call 4097 failed with result 101 

I decoded the NOCAT error records and found FBFLCN UNRECOGNIZED_CLIENT — the GPU's memory fabric doesn't recognize the requesting PCIe peer through the TB5 tunnel. This is a known issue affecting all NVIDIA GPUs on TB5 enclosures (tinygrad#15843). AMD GPUs work fine through the same enclosures. I've posted my NOCAT decode findings on the issue — would love to collaborate with the tinygrad team or anyone who's worked on NVIDIA GSP firmware init to get this fixed.

But here's what I found while debugging

While researching whether NVIDIA eGPU VRAM could eventually participate in RDMA transfers, I tested what memory types ibv_reg_mr() actually accepts on macOS. The results were surprising.

Memory type validation results

Memory Source ibv_reg_mr Expected?
malloc() FAIL Unexpected — works on Linux
posix_memalign() FAIL Unexpected — page-aligned but still fails
mmap(MAP_ANON) PASS Expected
IOSurfaceGetBaseAddress() PASS No documentation on this anywhere
MTLBuffer.contents (Metal shared) PASS No documentation on this anywhere
Apple's RDMA implementation validates VM-mapping type, not physical backing. Heap allocations (malloc/posix_memalign) fail. VM-mapped memory (mmap, IOSurface, Metal buffers) passes. This is different from Linux where ibv_reg_mr accepts any pinnable memory.

Triple-registered buffer — zero-copy proven

I created a single 64MB mmap buffer and registered it three ways simultaneously:

void *buf = mmap(NULL, 64*1024*1024, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0); // 1. RDMA Memory Region struct ibv_mr *mr = ibv_reg_mr(pd, buf, size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ); // PASS, lkey=0x101 // 2. Metal GPU buffer (zero-copy, same physical pages) id<MTLBuffer> metalBuf = [gpu newBufferWithBytesNoCopy:buf length:size options:MTLResourceStorageModeShared deallocator:nil]; // PASS // 3. Cross-consumer write test metalBuf.contents[0] = 99.99f; // Write via Metal assert(mr->addr[0] == 99.99f); // Read via RDMA — PASS, same memory 

One buffer, three consumers, zero copies. Apple GPU writes are immediately visible to the RDMA subsystem because they're the same physical pages. This means:

Apple GPU compute → [writes to shared buffer] → JACCL RDMA sends to remote node zero copy between these two ↑ 

Hidden ibv_reg_dmabuf_mr — Apple compiled it but hid it

Using dyld_info -exports on the dyld shared cache, I found symbols Apple compiled into libibverbs.dylib but deliberately excluded from the SDK headers:

ibv_reg_dmabuf_mr offset 0x4EC8 EXPORTED but NOT in <infiniband/verbs.h> ibv_cmd_reg_dmabuf_mr offset 0x43E4 EXPORTED but NOT in headers darwin_mmap_region_extended offset 0x75A0 Apple custom — not in upstream rdma-core mlx5_reg_dmabuf_mr offset 0x2CEA0 In libmlx5.dylib — Mellanox provider too 

ibv_reg_dmabuf_mr is the function Linux uses for GPUDirect RDMA (registering GPU VRAM as RDMA memory regions). `ibv_reg_dmabuf_mr` is the function Linux uses for GPUDirect RDMA (registering GPU VRAM as RDMA memory regions). I disassembled it and **it's not a stub — it's fully functional code:**

```

ibv_reg_dmabuf_mr (0x4EC8) → vtable dispatch

→ mlx5_reg_dmabuf_mr (libmlx5) → allocates MR struct, forwards all 6 args

→ ibv_cmd_reg_dmabuf_mr → builds 0x130-byte ioctl command struct

→ execute_ioctl → SENDS DIRECTLY TO THE KERNEL

```

Apple built and ships a complete DMA-BUF RDMA memory registration pipeline — from userspace through the Mellanox provider to a kernel ioctl. The only remaining question is whether `IORDMAFamily.kext` accepts or rejects the command.

Why this matters

Zero-copy GPU → RDMA is real on macOS. Metal compute results can be sent to remote cluster nodes without any intermediate copies. JACCL/MLX could leverage this for faster tensor parallelism. The ibv_reg_mr validation pattern (VM-mapped = pass, heap = fail) has implications for eGPU RDMA. TinyGPU's DriverKit driver maps NVIDIA GPU BAR1 memory via IOMemoryDescriptor, which creates a VM mapping — the same type that passes ibv_reg_mr. This suggests GPUDirect RDMA between NVIDIA eGPU VRAM and the TB5 RDMA controller might work on macOS without any kernel modification. (Currently blocked by a separate TinyGPU GSP firmware init issue on TB5 enclosures — see tinygrad/tinygrad#15843.) The hidden ibv_reg_dmabuf_mr suggests Apple is building toward device memory RDMA. They compiled it, they just haven't exposed it yet.

Hardware

  • 3x Mac Studio M3 Ultra (512GB + 512GB + 256GB = 1.28TB unified memory)
  • Thunderbolt 5 RDMA mesh via JACCL
  • Distributed inference baseline: DeepSeek-V4-Flash 151GB at 30 tok/s across 2 nodes
  • RTX PRO 5000 Blackwell 72GB in Razer Core X V2 (connected, detected, TinyGPU driver loaded — but NVIDIA GSP firmware fails to init through TB5, separate issue being tracked)

Test code

All test programs are Objective-C, compiled with:

clang -framework Foundation -framework Metal -framework IOSurface -lrdma -o test test.m 

Note: ibv_reg_mr on macOS requires an active RDMA device (rdma_en3/4/5, not rdma_en2 which may be PORT_DOWN). Use ibv_devinfo to check port state.

Where I need help

I'm going after this from multiple angles but there's more here than one person can cover. If any of this is in your wheelhouse: 1. TinyGPU GSP firmware init on TB5 (tinygrad#15843) The FBFLCN UNRECOGNIZED_CLIENT error during GSP boot suggests the GPU's memory fabric doesn't understand the TB5 PCIe topology. If you've worked on NVIDIA GSP firmware, open-gpu-kernel-modules, or PCIe tunneling — the NOCAT decode method I used (patching NVRpcQueue.read_resp to extract ASCII from POST_NOCAT_RECORD events) might help you dig deeper. 2. Ghidra analysis of ibv_reg_dmabuf_mr on macOS The function is at offset 0x4EC8 in libibverbs.dylib (dyld shared cache). Does it call execute_ioctl (real kernel path) or return ENOSYS (dead stub)? I have GhidraMCP set up and ready to go but if anyone has already disassembled Apple's RDMA stack, that would save days. 3. Has anyone tested ibv_reg_mr with device-mapped memory on macOS? The validation pattern I found (VM-mapped = pass, heap = fail) suggests PCIe BAR memory might pass too, since DriverKit BAR mappings create VM-mapped IOMemoryDescriptor regions. If you have any eGPU working on macOS (even AMD via TinyGPU), try calling ibv_reg_mr on the BAR1-mapped pointer. If it returns non-NULL, that's GPUDirect RDMA on macOS. 4. darwin_mmap_region_extended — what does "extended" mean? This is Apple's custom addition to rdma-core at offset 0x75A0. Not in upstream. The non-extended darwin_mmap_region exists too. If you've done any RE on Apple's RDMA stack, what extra parameters does the extended version accept?

The bigger picture

Apple builds capabilities, uses them internally, and hides them from public APIs. The question is whether ibv_reg_dmabuf_mr is functional or dead code, and that's a Ghidra session away from being answered. Here's why this matters for everyone, not just people with clusters: If GPUDirect RDMA works on macOS, any Mac with Thunderbolt becomes a hybrid AI workstation. Plug an NVIDIA GPU into your Mac via a $200 eGPU enclosure and the GPU's VRAM becomes part of your Mac's memory pool — accessible to Metal, to RDMA, to your inference stack, with zero-copy transfers. Your Mac's 128GB/256GB/512GB unified memory + the GPU's 24/48/72GB GDDR7, all working together. No Linux box. No separate PC. One cable. Right now TinyGPU lets you run CUDA compute on a Mac. What we're trying to prove is that the GPU's memory can also participate in Apple's RDMA network — meaning multi-Mac clusters can share NVIDIA VRAM across nodes. ~1.5TB of unified memory + 72GB GDDR7, all RDMA-capable, on hardware you can buy today.

This is a follow-up to my TinyGPU testing post. All test programs (Objective-C, ~50 lines each) and research notes available — happy to share the repo if there's interest. Also posted NOCAT decode findings on tinygrad#15843 if you want to help debug the TB5 GSP init.

submitted by /u/Street-Buyer-2428
[link] [comments]