Local Qwen3-0.6B INT8 as embedding backbone for an AI memory system

Reddit r/LocalLLaMA / 3/20/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The author builds a local embedding backbone using Qwen3-0.6B INT8 via ONNX Runtime to power a memory lifecycle system inside Claude Code, eliminating per-operation embedding API calls.
The system uses 1024-dimensional embeddings with a cosine similarity threshold above 0.75 to indicate genuine semantic relatedness, supports batch processing for 20+ entries, and achieves zero API calls.
To address the cold-start problem, a persistent embedding server on localhost:52525 loads the model at boot, delivering warm inference in about 12 ms per batch, roughly 250x faster than cold start.
The embedding solution enables a connection graph, cluster detection merged by an LLM, and similarity routing to correct config files, all CPU-based and open source, with the project available at the linked GitHub repository.

Most AI coding assistants solve the memory problem by calling an embedding API on every store and retrieve. This does not scale. 15-25 sessions per day means hundreds of API calls, latency on every write, and a dependency on a service that can change pricing at any time.

I needed embeddings for a memory lifecycle system that runs inside Claude Code. The system processes knowledge through 5 phases: buffer, connect, consolidate, route, age. Embeddings drive phases 2 through 4 (connection tracking, cluster detection, similarity routing).

Requirements: 1024-dimensional vectors, cosine similarity above 0.75 must mean genuine semantic relatedness, batch processing for 20+ entries, zero API calls.

I tested several models and landed on Qwen3-0.6B quantized to INT8 via ONNX Runtime. Not the obvious first pick. Sentence-transformers models seemed like the default choice, but Qwen3-0.6B at 1024d gave better separation between genuinely related entries and structural noise (session logs that share format but not topic).

The cold start problem: ONNX model loading takes ~3 seconds. For a hook-based system where every tool call can trigger an embedding check, that is not usable. Solution: a persistent embedding server on localhost:52525 that loads the model once at system boot. Warm inference: ~12ms per batch, roughly 250x faster than cold start.

The server starts automatically via a startup hook. If it goes down, the system falls back to direct ONNX loading. Nothing breaks, it just gets slower.

What the embeddings enable:

Connection graph: new entries get linked to existing entries above 0.75 cosine similarity. Isolated entries fade over time. Connected entries survive. Expiry based on isolation, not time.

Cluster detection: groups of 3+ connected entries get merged into proven knowledge by an LLM (Gemini Flash free tier for consolidation).

Similarity routing: proven knowledge gets routed to the right config file based on embedding distance to existing content.

All CPU, no GPU needed. The 0.6B model runs on any modern machine. Single Python script, ~2,900 lines, SQLite + ONNX.

Open source: github.com/living0tribunal-dev/claude-memory-lifecycle

Full engineering story with threshold decisions and failure modes: After 3,874 Memories, My AI Coding Assistant Couldn't Find Anything Useful

Anyone else using small local models for infrastructure rather than generation? Embeddings feel like the right use case for sub-1B parameters.

submitted by /u/living0tribunal
[link] [comments]

NVIDIA、GTC 2026で次世代AI基盤を発表「Vera Rubin」を軸にエージェント・ゲーム・宇宙領域へ展開のサムネイル画像

Ledge.ai

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表人間・マシン・AIの資格情報を一元統制のサムネイル画像

Ledge.ai

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

note

「お金、見直したいけどどこから？」AIが改善ヒントを教えてくれる、公式プロンプトを公開

note

Copilotと物語を作ってみた #213 めーっちゃボロボロこぼす女の子の物語

note

Local Qwen3-0.6B INT8 as embedding backbone for an AI memory system

Key Points

Related Articles

NVIDIA、GTC 2026で次世代AI基盤を発表「Vera Rubin」を軸にエージェント・ゲーム・宇宙領域へ展開のサムネイル画像

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表人間・マシン・AIの資格情報を一元統制のサムネイル画像

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

「お金、見直したいけどどこから？」AIが改善ヒントを教えてくれる、公式プロンプトを公開

Copilotと物語を作ってみた #213 めーっちゃボロボロこぼす女の子の物語

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

NVIDIA、GTC 2026で次世代AI基盤を発表 「Vera Rubin」を軸にエージェント・ゲーム・宇宙領域へ展開のサムネイル画像

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表 人間・マシン・AIの資格情報を一元統制のサムネイル画像

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

「お金、見直したいけどどこから？」AIが改善ヒントを教えてくれる、公式プロンプトを公開

Copilotと物語を作ってみた #213 めーっちゃボロボロこぼす女の子の物語

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

NVIDIA、GTC 2026で次世代AI基盤を発表「Vera Rubin」を軸にエージェント・ゲーム・宇宙領域へ展開のサムネイル画像

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表人間・マシン・AIの資格情報を一元統制のサムネイル画像