Compile English function descriptions into 22MB neural programs that run locally via llama.cpp

Reddit r/LocalLLaMA / 4/16/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

研究者は、英語の関数仕様を入力すると「連続LoRAアダプタ＋離散的な擬似プログラム」からなる“neural program”を生成し、固定されたインタプリタをそのタスク用に適応させる仕組みを提案しています。
インタプリタ側（例：Qwen3 0.6BやGPT-2）は推論時に一切更新せず、タスク固有の挙動はコンパイルされたneural programのみで実現されます。
学習は(英語説明・入力・出力)の約1,000万例をGPT-5.2で合成したデータでエンドツーエンドに行われ、推論はllama-cpp-pythonによりローカル実行でき、タスクごとのLoRAを差し替えるだけで運用できます。
コンパイルされたプログラムはQ4_0形式で約22MB追加するだけで、Qwen3 0.6Bインタプリタ（約594MB）を共有しつつ複数タスクを切り替えられるのが特徴です。
FuzzyBenchの結果では、適応インタプリタが32Bプロンプトと同程度の性能を示す一方、タスクごとに再コンパイルが必要である点がトレードオフとして示されています。

Compile English function descriptions into 22MB neural programs that run locally via llama.cpp

We built a system where a neural compiler takes a plain-English function description and produces a "neural program" (a combination of a continuous LoRA adapter and a discrete pseudo-program). At inference time, these adapt a fixed interpreter to perform the specified task. This is very suitable for implementing "fuzzy functions", functions that are easy to describe in language but painful to implement with rigid rules (such as classifying the urgency of a message, or even counting the number of verbs in a sentence, or even regular expressions which is always painful for me).

The key idea: the interpreter (Qwen3 0.6B or GPT-2 124M) weights are never modified. All task-specific behavior comes from the compiled program. The compiler itself is a 4B LM that generates the adapter weights and pseudo-program from the spec. Trained end-to-end on a dataset of 10 million (English description, function input, function output) examples synthesized by gpt-5.2.

Inference runs entirely locally through llama-cpp-python. The base model is shared and the "neural programs" are LoRA adapters that we can easily swap at runtime. The Qwen3 0.6B interpreter is ~594 MB base model (GGUF Q6_K), and each compiled program (GGUF Q4_0) adds ~22 MB. Runs pretty fast on my Mac Mini.

We also trained a compiler to adapt a GPT-2 124M interpreter that runs in the browser via WebAssembly with wllama (~134 MB Q8_0 base + ~5 MB per Q4_0 program). Interestingly, even a model as old as GPT-2 can get a decent performance.

Results on FuzzyBench show that the adapted 0.6B interpreter is on par with prompting a 32B model (at the cost that each new task requires a new compilation):

PAW + Qwen3 0.6B interpreter: 73.4%
Qwen3 0.6B prompting: 9.8%
Qwen3 32B prompting: 68.7%

You can easily use it by:

pip install programasweights import programasweights as paw f = paw.compile_and_load("Classify if this is urgent or not.") f("Need your signature by EOD") # "urgent"

Demo: https://programasweights.com

submitted by /u/yuntiandeng
[link] [comments]