H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

arXiv cs.LG / 3/13/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

H2LooP Spark Preview presents a continual pretraining pipeline that adapts the OLMo-3-7B-a LLM to embedded systems programming using BF16 LoRA on 8 NVIDIA H100 GPUs.
The training data combines 100B tokens of repository-datasheet pairs from 117 manufacturers, with a curated 23.5B tokens spanning 13 embedded domains via a SpecMap-inspired mapping approach.
In benchmarks, the 7B model achieves superior token accuracy across 13 embedded domains, reducing in-domain perplexity by 70.4% and held-out repository perplexity by 66.1%, outperforming Claude Opus 4.6 and Qwen3-Coder-30B in 8 categories.
The authors release the production training checkpoint on Huggingface as an open-source artifact, enabling broader use by researchers and practitioners.

Abstract

Large language models (LLMs) demonstrate strong code generation abilities in general-purpose programming languages but remain limited in specialized domains such as low-level embedded systems programming. This domain involves hardware register manipulation, vendor-specific SDKs, real-time operating system APIs, and hardware abstraction layers that are underrepresented in standard pretraining corpora. We introduce H2LooP Spark Preview, a continual pretraining (CPT) pipeline that adapts the OLMo-3-7B-a fully open language model to the embedded systems domain using BF16 LoRA with rank-stabilized scaling on 8 NVIDIA H100 GPUs. Our training corpus is constructed from repository-datasheet pairs covering 100B tokens of raw embedded systems data across 117 manufacturers, processed using the hierarchical datasheet-to-code mapping approach proposed in SpecMap (Nipane et al., 2026). The resulting curated dataset split contains 23.5B tokens across 13 embedded domains. Continual pretraining with high-rank LoRA (r=512) yields substantial gains, reducing in-domain perplexity by 70.4% and held-out repository perplexity by 66.1%. On generative code completion benchmarks spanning 13 embedded domains, our 7B model outperforms Claude Opus 4.6 and Qwen3-Coder-30B on 8 categories in token accuracy, showing that targeted continual pretraining enables smaller open-weight models to rival frontier systems on specialized technical tasks. We release the production training checkpoint on Huggingface as an open-source artifact.

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

AI Cybersecurity

Dev.to

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

Dev.to

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

Dev.to

H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

Key Points

Abstract

Related Articles

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

AI Cybersecurity

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer