Question: Prompt format for memory injection (local offline AI assistant, 6GB VRAM)?

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The article describes a local offline AI assistant that extracts relevant “memories” from past chats using embedding similarity (snowflake-arctic-embed-s) plus reranking (bge-reranker-v2).
Retrieved memories are injected into the next user prompt so the LLM (Qwen3.5 9B Q4_K_M) can use prior context when answering.
The main problem is prompt-format confusion: the model sometimes treats injected memories as if they were current chat content, leading to outdated or mismatched responses.
The user is experimenting with a structured system prompt and an augmented user prompt that includes labeled sections like “### INFORMATION ###” and “### USER INPUT ###,” plus timestamps and a “last conversation summary.”
The question specifically asks for best practices on prompt formatting to more reliably separate injected background memory from the user’s actual current prompt.

Hi there!

My question(-s) are at the bottom, but let me tell you what I am trying to do and how, first:

For my work-in-progress offline AI assistant I implemented a very simple memory system that stores statements ("memories") extracted from earlier chats in an Sqlite database.

In a later chat, each time after the user enters a prompt, the system extracts the most relevant of these "memories" via embedding vector cosine similarity comparance and reranking (I am using snowflake-arctic-embed-s Q8_0 for embeddings and bge-reranker-v2-m3 Q5_k_m for reranking right now).

After that, these "memories" are getting injected into the (user) prompt, before it is send to the LLM to get an answer.

The LLM in use is Qwen3.5 9B Q4_K_M (parameters: Top-k = 40, top-p: 0.95, min-p = 0.01, temperature = 1.0, no thinking/reasoning).

Qwen 3.5 9B is a BIG step from what I was using before, but to differentiate between the memories and the actual user prompt / the current chat is still sometimes hard to do for the model.

This causes "old" information from the memories injected being used in the LLM's answer in the wrong way (e.g., if a friend was visiting some weeks ago, the LLM asks, if we are having a great time, although it would be clear to a smarter model or a human that the visit of the friend is long over).

You can see the system prompt format and the augmented user prompt I am currently experimenting with below:

The system prompt:

A conversation with the user is requested. ### RULES ### - Try to keep your answers simple and short. - Don't put a question in every reply. Just sporadically. - Use no emojis. - Use no lists. - Use no abbreviations. - User prompts will hold 2 sections: One holds injected background information (memories, date, time), the other the actual user prompt you need to reply to. These sections have headings like "### INFORMATION ###" and "### USER INPUT ###". ### LAST CONVERSATION SUMMARY ### A user initiated a conversation by greeting the assistant with "Good day to you." The assistant responded with a similar greeting, stating "Good day," and added that it was nice to hear from the user again on that specific date. The dialogue consisted solely of these mutual greetings and the assistant's remark about a recurring interaction, with no further topics or details exchanged between the parties. - Last conversation date and time: 2026-03-30 13:20 (not a day ago) - Current weekday, date, time: Monday, 2026-03-30 13:22

The augmented user prompt (example):

### INFORMATION (not direct user input) ### MEMORIES from earlier chats: - From 2026-03-26 (4 days ago): "The user has a dog named Freddy." - From 2026-03-26 (4 days ago): "The user went for a walk with his dog." - From 2026-03-27 (3 days ago): "The user has a car, but they like to go for walks in the park." NOTES about memories: - Keep dates in mind, some infos may no longer be valid. - Use/reference a memory only, if you are sure that it makes sense in the context of the current chat. Current weekday, date, time: Monday, 2026-03-30 13:22 ### USER INPUT ### Hello, I am back from walking the dog.

As you can see, I am already telling the LLM a lot about what is what and from when the information is and how to use it.

Do you have some ideas on how to improve the prompt (formats) to help the LLM understand better?
Or do you think this is a waste of time with the 9B weights model anyway, because it is just not "smart enough" / has too few parameters to be able to do that?

Unfortunately, my hardware is limited, this is all running on an old gaming laptop with 32GB RAM (does not matter that much) and 6GB VRAM (GeForce Mobile 3060) and a broken display, with Debian Linux and llama.cpp (see mt_llm).

Thanks in advance!

submitted by /u/rhinodevil
[link] [comments]