[P] Inferencing Llama3.2-1B-Instruct on 3xMac Minis M4 with Data Parallelism using allToall architecture! | smolcluster

Reddit r/MachineLearning / 3/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • Demonstrates data-parallel inference of Llama3.2-1B-Instruct across 3 Mac Minis (M4, 16 GB RAM each) using a self-built allToall architecture.
  • In this approach, every worker exchanges activations with every other worker and averages them before decoding, enabling data parallelism when the model cannot fit on a single device.
  • The architecture allows any worker to communicate directly with others, unlike a master-worker setup where communication is channeled through a server.
  • The setup uses 3 Mac Minis and Thunderbolt 4 cables, with the implementation and instructions available on GitHub.

Here's another sneak-peek into inference of Llama3.2-1B-Instruct model, on 3xMac Mini 16 gigs each M4 with smolcluster!

Today's the demo for my Data Parallelism implementation using allToall architecture, all written from scratch using only socket libraries for communications.

  • Data parallelism allows for data to be shared across many gpus but each gpu will have the full model on them. It's used when you have data not fitting on a single gpu.
  • I went for a allToall architecture where each worker is connected to every other worker. For inferencing, all the workers send their activations to each other and takes a simple arithmetic average of all the activations before decoding starts.
  • Well, that means, you can choose, any of the workers chat with them directly unlike in a master-worker node where you can only communicate with the server.

Thats it for the basic theory of DP for inferencing with allToall architecture!

Setup:

  • 3xMac Minis 2025 M4 16 GB RAM each
  • Thunderbolt 4 cables

Github

submitted by /u/East-Muffin-6472
[link] [comments]