I built "Semvec": A Constant-Cost Semantic Memory for LLMs (Looking for testers!)

Reddit r/LocalLLaMA / 5/2/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author introduces Semvec, a constant-cost semantic memory approach that replaces unbounded chat history with a fixed-size semantic state plus tiered (short/medium/long-term) memory.
  • Semvec aims to keep per-call token cost and latency constant, so early and late conversation turns have the same input footprint.
  • In 48-turn benchmarks, the method reportedly reduces tokens by about 76% while retaining access to decision structure, error patterns, and earlier context.
  • It is offered as a drop-in chat proxy for OpenAI-compatible LLMs (e.g., vLLM, Ollama, OpenRouter), and includes coding-agent memory via an MCP server for Claude Code and Cursor.
  • The project solicits developer testers, especially those building RAG pipelines, chatbots, or looking to improve IDE/autonomous coding session memory.

Hey everyone,

if you build LLM applications, autonomous agents, or just use Claude/Cursor for coding, you've probably hit this wall: Conversation history grows infinitely, token costs explode, latency skyrockets, and eventually, the LLM starts forgetting early context anyway.

To fix this, I built Semvec. It replaces unbounded conversation histories with a fixed-size semantic state combined with a tiered, content-aware memory (short/medium/long-term).

The result: The cost and latency of every LLM call stay constant. Turn 10 and Turn 10,000 carry the exact same input footprint. In 48-turn benchmarks, it yields roughly a 76% token reduction while retaining all structured access to decisions, error patterns, and prior context.

Here is what you get:

- Constant-size compressed context: Token-reduced LLM context that stops growing.

- Tiered memory with selective forgetting: Frequently accessed older memories outlive never-touched newer ones.

- Drop-in chat proxy: Wrap any OpenAI-compatible LLM (vLLM, Ollama, OpenRouter) and get compressed context for free.

- Coding-agent compaction (MCP): Persistent memory across coding sessions. It comes with an MCP server for Claude Code & Cursor out of the box!

- Multi-agent coordination: semvec.cortex allows several agents to share an aggregated view and exchange state vectors.

I am currently looking for testers and honest feedback from devs who build RAG pipelines, chatbots, or just want to upgrade their Claude Code or Cursor IDE memory.

📦 PyPI: https://pypi.org/project/semvec/

📚 Docs & Quickstart: https://semvec-docs.pages.dev/

You can install it via: pip install semvec (Supports Python 3.10–3.14).

If you want to test the multi-agent or MCP stuff, use pip install "semvec[cortex,coding]".

I'd love to hear your thoughts, feedback, and edge-case bug reports! Let me know what you think

submitted by /u/scheitelpunk1337
[link] [comments]