Here's another sneak-peek into inference of Llama3.2-1B-Instruct model, on 3xMac Mini 16 gigs each M4 with smolcluster!
Today's the demo for my Data Parallelism implementation using allToall architecture, all written from scratch using only socket libraries for communications.
- Data parallelism allows for data to be shared across many gpus but each gpu will have the full model on them. It's used when you have data not fitting on a single gpu.
- I went for a allToall architecture where each worker is connected to every other worker. For inferencing, all the workers send their activations to each other and takes a simple arithmetic average of all the activations before decoding starts.
- Well, that means, you can choose, any of the workers chat with them directly unlike in a master-worker node where you can only communicate with the server.
Thats it for the basic theory of DP for inferencing with allToall architecture!
Setup:
- 3xMac Minis 2025 M4 16 GB RAM each
- Thunderbolt 4 cables
[link] [comments]