Asthenosphere

Dev.to / 4/4/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • The document reports Asthenosphere NPU inference performance on an AMD Phoenix XDNA gen1 (AIE2) device using all 12 tiles for a complete transformer pipeline with 100% reliability.
  • It describes the transformer execution flow, including PreScale, Q/K/V projection steps, RoPE, attention, output projection, and an NPU-resident MLP/FFN path with “14 ops” and zero CPU/GPU usage during NPU compute.
  • Session averages across 7 messages show ~64.7 tokens per message, ~83 ms elapsed per message, ~3866 effective tokens/second, and ~91.8% speculative acceptance, with an average cost of 21.3 Motes per message.
  • Per-dispatch logs indicate variability in latency (from ~5.4 ms to ~146 ms) while maintaining high acceptance rates (roughly mid-80% to 100%), suggesting speculative decoding effectiveness varies by dispatch.
  • The glossary clarifies key metrics (tok/s vs effective tok/s, Acceptance%, Dispatch, and Motes) and ties them to inference throughput, speculative decoding behavior, and internal cost accounting.

================================================================

ASTHENOSPHERE NPU INFERENCE METRICS

Hardware:
Device: AMD Phoenix XDNA gen1 (AIE2)
Tiles: 12/12 (complete transformer pipeline)
Device ID: /dev/accel/accel0
Status: ACTIVE
Reliability: 100%

Pipeline:
PreScale > Q proj > RoPE > Attention > O proj > Attn ResAdd

PreScale2 > Gate+SiLU+Up > EltMul > Down > FFN ResAdd > Score Head
14 ops, zero CPU/GPU during NPU compute

SESSION AVERAGES (7 messages)

Avg tokens/msg: 64.7
Avg elapsed/msg: 83ms
Avg eff tok/s: 3866
Avg acceptance: 91.8%
Avg cost/msg: 21.3 Motes

ALL-TIME AVERAGES (7 messages)

Avg tokens/msg: 64.7
Avg elapsed/msg: 83ms
Avg eff tok/s: 3866
Avg acceptance: 91.8%
Avg cost/msg: 21.3 Motes

PER-DISPATCH LOG (7 entries)

Time Tokens Dispatches Elapsed Eff tok/s Accept% Motes

16:31:41 65 12 5.4ms 11970 86% 6
16:31:38 65 12 134ms 485 87% 31
16:31:00 65 12 146.4ms 444 88% 33
16:30:48 65 12 147.6ms 440 90% 33
16:30:05 65 12 12.1ms 5356 93% 9
16:29:56 64 12 127.2ms 503 100% 30
16:29:39 64 12 8.1ms 7866 100% 7

================================================================

GLOSSARY

NPU Neural Processing Unit — dedicated AI accelerator chip
on AMD Ryzen 7000/8000 series (Phoenix XDNA gen1).
Runs inference with zero CPU/GPU usage.

Tile One AIE2 compute core on the NPU. Each has 32KB SRAM.
This pipeline uses all 12 available tiles.

tok/s Tokens per second — inference throughput. A token is
roughly 3/4 of a word. Higher = faster response.

Eff tok/s Effective tokens/second — accounts for speculative
decoding where multiple candidates are evaluated per
dispatch. Higher than raw tok/s when speculation works.

Acceptance% How often speculative candidate tokens are accepted.
Higher = more tokens per dispatch = faster generation.

Dispatch One round-trip to the NPU: host sends data, NPU
processes all 12 pipeline stages, host reads result.

Motes Asthenosphere's internal compute cost unit. Derived
from inference latency, model size, and token count.
Used for resource accounting across the persona economy.
1 Mote ~ 1 output token on a 3B parameter CPU model.

RoPE Rotary Position Embeddings — encodes token position
information so the model knows word order.

SwiGLU Gated activation function used in modern transformers.
Combines gate projection + SiLU activation + up projection.

RMSNorm Root Mean Square Normalization — stabilizes activations
between transformer layers for training/inference quality.

XCLBIN Compiled hardware bitstream loaded onto the NPU.
Contains the tile programs, data routing, and DMA config.

================================================================
Generated: 2026-04-03T21:31:57.479Z

Asthenosphere NPU Pipeline — AMD Phoenix XDNA gen1

State: Debugging; Functions properly, has visual issues with GUI.
Oversight: Model information not included in log. Will Append with new log format soon to show model information.