Article excerpt:
With a single PCIe card — powered by six HTX301 chips and 384 GB of memory — enterprises can now run 700B-parameter model inference locally at just ~240W per card.
The memory-bandwidth-intensive token generation that dominates real-world inference latency. Existing GPUs handle compute-dense prefill; HTX301 cards handle decode. Each silicon matched to its phase.
This is a really interesting approach.
It only lets the GPU handle the prefill stage, while everything else, including the model weights and decoding, runs entirely on this card. That way, you can run huge billion parameter models without needing to chase after graphics cards with massive VRAM.
As for how the actual product will perform in real life, we'll have to wait until early June at Computex to find out.
[link] [comments]


