H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

arXiv cs.LG / 3/13/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

H2LooP Spark Preview presents a continual pretraining pipeline that adapts the OLMo-3-7B-a LLM to embedded systems programming using BF16 LoRA on 8 NVIDIA H100 GPUs.
The training data combines 100B tokens of repository-datasheet pairs from 117 manufacturers, with a curated 23.5B tokens spanning 13 embedded domains via a SpecMap-inspired mapping approach.
In benchmarks, the 7B model achieves superior token accuracy across 13 embedded domains, reducing in-domain perplexity by 70.4% and held-out repository perplexity by 66.1%, outperforming Claude Opus 4.6 and Qwen3-Coder-30B in 8 categories.
The authors release the production training checkpoint on Huggingface as an open-source artifact, enabling broader use by researchers and practitioners.

Abstract

Large language models (LLMs) demonstrate strong code generation abilities in general-purpose programming languages but remain limited in specialized domains such as low-level embedded systems programming. This domain involves hardware register manipulation, vendor-specific SDKs, real-time operating system APIs, and hardware abstraction layers that are underrepresented in standard pretraining corpora. We introduce H2LooP Spark Preview, a continual pretraining (CPT) pipeline that adapts the OLMo-3-7B-a fully open language model to the embedded systems domain using BF16 LoRA with rank-stabilized scaling on 8 NVIDIA H100 GPUs. Our training corpus is constructed from repository-datasheet pairs covering 100B tokens of raw embedded systems data across 117 manufacturers, processed using the hierarchical datasheet-to-code mapping approach proposed in SpecMap (Nipane et al., 2026). The resulting curated dataset split contains 23.5B tokens across 13 embedded domains. Continual pretraining with high-rank LoRA (r=512) yields substantial gains, reducing in-domain perplexity by 70.4% and held-out repository perplexity by 66.1%. On generative code completion benchmarks spanning 13 embedded domains, our 7B model outperforms Claude Opus 4.6 and Qwen3-Coder-30B on 8 categories in token accuracy, showing that targeted continual pretraining enables smaller open-weight models to rival frontier systems on specialized technical tasks. We release the production training checkpoint on Huggingface as an open-source artifact.

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

Waymo hits 170 million miles while avoiding serious mayhem

The Verge

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

Dev.to

QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!

Reddit r/LocalLLaMA

Signal’s Creator Is Helping Encrypt Meta AI

Wired

H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

Key Points

Abstract

Related Articles

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Waymo hits 170 million miles while avoiding serious mayhem

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!

Signal’s Creator Is Helping Encrypt Meta AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer