Hey everyone,
if you build LLM applications, autonomous agents, or just use Claude/Cursor for coding, you've probably hit this wall: Conversation history grows infinitely, token costs explode, latency skyrockets, and eventually, the LLM starts forgetting early context anyway.
To fix this, I built Semvec. It replaces unbounded conversation histories with a fixed-size semantic state combined with a tiered, content-aware memory (short/medium/long-term).
The result: The cost and latency of every LLM call stay constant. Turn 10 and Turn 10,000 carry the exact same input footprint. In 48-turn benchmarks, it yields roughly a 76% token reduction while retaining all structured access to decisions, error patterns, and prior context.
Here is what you get:
- Constant-size compressed context: Token-reduced LLM context that stops growing.
- Tiered memory with selective forgetting: Frequently accessed older memories outlive never-touched newer ones.
- Drop-in chat proxy: Wrap any OpenAI-compatible LLM (vLLM, Ollama, OpenRouter) and get compressed context for free.
- Coding-agent compaction (MCP): Persistent memory across coding sessions. It comes with an MCP server for Claude Code & Cursor out of the box!
- Multi-agent coordination: semvec.cortex allows several agents to share an aggregated view and exchange state vectors.
I am currently looking for testers and honest feedback from devs who build RAG pipelines, chatbots, or just want to upgrade their Claude Code or Cursor IDE memory.
📦 PyPI: https://pypi.org/project/semvec/
📚 Docs & Quickstart: https://semvec-docs.pages.dev/
You can install it via: pip install semvec (Supports Python 3.10–3.14).
If you want to test the multi-agent or MCP stuff, use pip install "semvec[cortex,coding]".
I'd love to hear your thoughts, feedback, and edge-case bug reports! Let me know what you think
[link] [comments]




