I Fixed 5 Chained AI Bugs in My Sales Chatbot — Each Solution Revealed the Next Problem

Dev.to / 4/25/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The author reports that debugging their WhatsApp-based AI sales chatbot for Provia turned one apparent issue into a chain of five interconnected bugs, where each fix exposed the next underlying problem.
The first bug involved “summary pollution,” where the chatbot’s conversation summary contaminated the pgvector semantic search query, causing recommendations to ignore the user’s latest intent.
The debugging process highlights how GPT-4o-mini function calling, conversation context tracking, and PostgreSQL/pgvector embedding search can interact in unexpected ways, producing confident but incorrect product retrieval.
Overall, the post emphasizes the importance of isolating failure modes across memory/context handling and retrieval logic rather than assuming a single faulty component.

TL;DR: I spent a full day debugging my AI sales chatbot. What looked like one bug turned out to be five, stacked on top of each other. Each fix revealed the next problem underneath. Here's the full story.

You know that feeling when you fix a bug and your app gets worse?

Not in the "oops I introduced a regression" way. In the "oh no, the previous bug was masking another bug" way. And then you fix that one, and there's another one underneath. Like pulling threads on a sweater until you're holding a pile of yarn and wondering if you ever really had a sweater at all.

That's what happened to me during Session 6 of building Provia — an AI-powered e-commerce platform where store owners get a fully autonomous sales chatbot. The chatbot talks to customers over WhatsApp, recommends products from a real database, handles objections, and closes sales. Under the hood, it's GPT-4o-mini with function calling, backed by PostgreSQL with pgvector embeddings for semantic product search.

It was supposed to be a "quick debugging session." It turned into an eight-hour archaeology dig through five layers of interconnected bugs. Here's the full story.

The Setup: What Provia's AI Does

Before we dive in, here's what the system does at a high level:

A customer sends a message (e.g., "show me something for a wedding")
The AI searches the product database using semantic embeddings
The AI generates a response with product recommendations
The conversation continues, with the AI tracking context, preferences, and conversation stage

The product database uses pgvector — each product has a 1536-dimension embedding generated from its name, description, category, vibe, and other metadata using OpenAI's text-embedding-3-small model. When a customer asks for something, we embed their query and find the closest products in vector space.

Simple enough, right? Well, the devil lives in the implementation.

Bug 1: Summary Pollution — When Memory Becomes Contamination

The Symptom

A tester was chatting with the bot about suits. Ten messages into the conversation, they pivoted: "actually, show me some hoodies."

The bot responded with... more suits. Confidently. As if the word "hoodies" hadn't been spoken.

The Investigation

I dove into the logs. The search query being sent to pgvector wasn't just the customer's message. It was the customer's message plus a conversation summary that the system had been maintaining.

The summary looked like this:

Customer is looking for a $300 formal suit for a wedding occasion. 
They prefer dark colors and slim fit. Budget is flexible for the right piece.

This summary was being concatenated with the customer's latest message before embedding. So the actual search query became:

Customer is looking for a $300 formal suit for a wedding occasion. 
They prefer dark colors and slim fit. Budget is flexible for the right piece.
show me hoodies

When you embed that block of text, what do you get? An embedding that's 80% "formal suits" and 20% "hoodies." The vector math doesn't care that the customer changed their mind. It cares about token frequency and semantic weight. And the summary — being longer and more detailed — dominated the embedding completely.

The Fix

I killed the conversation summary. Completely. Ripped it out.

But I didn't throw away the concept of memory. Instead, I replaced it with a structured Customer Profile — a lean set of bullet points tracking style preferences, colors, budget, likes, and dislikes:

interface CustomerProfile {
  style_preferences: string[];
  colors: string[];
  budget: string | null;
  likes: string[];
  dislikes: string[];
  occasion: string | null;
}

The critical design decision: this profile gets injected into the response prompt (so the AI can personalize its replies), but it never touches the search query. Search and memory became two completely separate paths.

I felt good. Bug squashed. Time to test.

That feeling lasted about four minutes.

Bug 2: Raw Messages Make Terrible Search Queries

The Symptom

With the summary gone, the search now used the customer's raw message as the query. The next test message was:

acctaly i dont want a hoodie i have a wedding ocation

The search returned a mix of hoodies and wedding outfits. Which sounds reasonable until you realize the customer explicitly said they don't want a hoodie.

The Investigation

This one was immediately obvious once I looked at it with fresh eyes. The customer's message contains:

"hoodie" — something they explicitly DON'T want
"wedding" — something they DO want
"acctaly", "dont", "ocation" — typos everywhere

Text embeddings don't understand negation. They don't know that "don't want a hoodie" means the opposite of "hoodie." To the embedding model, the word "hoodie" fires up the same semantic neighborhood regardless of whether it's preceded by "I love" or "I don't want."

And the typos? text-embedding-3-small handles them surprisingly well in isolation, but when you combine misspelled negations with misspelled targets in a single query, the embedding becomes a semantic smoothie. It picks up everything and commits to nothing.

The Fix

I introduced a dedicated Search Call — a separate, lightweight AI call whose only job is to interpret what the customer wants and produce a clean search query.

const searchInterpretation = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    {
      role: "system",
      content: `You are a search query interpreter. Given a customer message, 
      extract ONLY what they want to find. Ignore negations (what they don't want). 
      Output a short, clean search phrase.`
    },
    {
      role: "user",
      content: `Customer said: "${customerMessage}"`
    }
  ],
  max_tokens: 150,
});

Input: ~60 tokens. Output: ~20 tokens. Cost: negligible.

For "acctaly i dont want a hoodie i have a wedding ocation," the search call returns: "wedding occasion outfit". Clean, correct, typo-free.

Two bugs down. System's looking solid. Let me just add a little context to help the search call...

Bug 3: Bot Reply Dominance — The Loudest Voice in the Room

The Symptom

I figured the search call could benefit from a bit of context. So I fed it two messages: the bot's previous reply and the customer's latest message.

The customer said: "hoodies"

The bot's previous reply was:

Great choice! For a wedding, I'd recommend our Premium Wool Blend Suit in charcoal — 
it's $289 and perfect for formal occasions. We also have the Classic Navy Blazer Set 
at $245 which pairs beautifully with dress pants. Would you like to see more formal options?

Search results: suits and blazers. Not a hoodie in sight.

The Investigation

Count the tokens. The bot's reply: ~50 words about suits, prices, formal wear. The customer's message: 1 word — "hoodies."

When you embed that combined text, the suit-related tokens outnumber the hoodie token roughly 50 to 1. The embedding lands squarely in "formal menswear" vector space, with "hoodies" contributing approximately nothing.

This is a fundamental issue with how embeddings work. They represent the average semantic meaning of the entire input text. A single word cannot fight against a paragraph.

The Fix

Zero history for the search call. Absolutely none.

// SEARCH CALL — customer's latest message ONLY
const searchMessages = [
  {
    role: "system" as const,
    content: "Extract what the customer wants to search for. Short phrase only."
  },
  {
    role: "user" as const,
    content: `Customer said: "${latestCustomerMessage}"`
  }
];

This created what I started calling the Two-Context Architecture:

	Search Context	Response Context
Purpose	Decide WHAT to search for	Decide HOW to respond
Input	Customer's latest message only	6 messages + profile + search results
History	None	Recent session window
Cost	~60 tokens	~500 tokens

The search call is deliberately amnesiac. The response AI handles context. The search AI handles intent. Separation of concerns, but for AI calls.

Bug 4: The Pajama Problem — When "Night" Means Everything

The Symptom

The search call was working beautifully. But one product kept showing up where it didn't belong: the "Cozy Night Deluxe Loungewear Set."

It's pajamas. Comfortable, stay-at-home pajamas.

It showed up in results for:

"date night outfit" (because "night")
"evening wear" (because "night" is semantically close to "evening")
"casual summer outfit" (because "cozy" and "casual" are neighbors)

The Investigation

This was an embedding similarity threshold problem. I had set the threshold at 0.1 — meaning any product with a cosine similarity above 0.1 was returned as a match.

For context, with text-embedding-3-small, truly relevant products score around 0.3-0.5, somewhat relevant products score 0.15-0.3, and noise lives below 0.15.

At 0.1, I was scooping up enormous amounts of noise. The pajama set sat at around 0.15-0.22 similarity with a huge range of queries.

The Fix

Single threshold at 0.3. No near-match tier. Clean cuts only.

But a high threshold means sometimes you get no results. So I built a fallback chain:

async function searchProducts(query: string, storeId: string) {
  // Tier 1: Semantic search with strict threshold
  let results = await semanticSearch(query, storeId, 0.3);

  if (results.length === 0) {
    // Tier 2: ILIKE text match (catches exact keyword matches)
    results = await textSearch(query, storeId);
  }

  if (results.length === 0) {
    // Tier 3: Return available categories
    const categories = await getStoreCategories(storeId);
    return { results: [], categories, fallback: true };
  }

  return { results, categories: null, fallback: false };
}

Four bugs fixed. The search pipeline was now clean, fast, and accurate. Then I looked at the actual responses.

Bug 5: The Response That Ignores Its Own Data

The Symptom

Customer conversation, 10 messages deep, all about suits. Customer says: "actually, show me hoodies."

Search call returns hoodies (correctly!). Hoodies are injected into the response prompt as search results.

The bot responds: "I think you'll love our Classic Charcoal Suit for formal occasions..."

The search found the right products. The response ignored them completely.

The Investigation

Here's what the model was seeing:

System prompt: Store persona, sales instructions, tone guidance
Chat history: 10 messages about suits (~400 tokens)
Search results: 3 hoodies (~150 tokens)
Latest customer message: "actually, show me hoodies" (6 tokens)

The model followed the dominant topic. Ten messages of suit conversation created a strong gravitational pull. The hoodies in the search results were a small island in a sea of formal wear.

The Fix

I injected the customer's latest message directly into the system prompt, with an explicit instruction:

const systemPrompt = `You are ${persona.name}, a sales assistant for ${storeName}.

${persona.instructions}

---
The customer's latest message: "${latestCustomerMessage}"
IMPORTANT: Your reply MUST directly address this latest message. 
If the customer asked about a new topic or product, focus on THAT topic, 
not the previous conversation.
---

${searchResults ? `Available products matching their request:
${formatProducts(searchResults)}` : ''}
`;

System prompts receive disproportionate attention from language models. By putting the customer's latest message there — not just in the chat history — it becomes a directive the model actually follows.

The Final Architecture

Customer message
    |
    v
SEARCH CALL (~60 tokens)
    Input: "Customer said: '[msg]'. Call search_products."
    History: NONE
    |
    v
Search pipeline:
    Semantic search (threshold 0.3)
    -> ILIKE fallback
    -> Category fallback
    |
    v
RESPONSE CALL (~500 tokens)
    System: persona + profile + "Latest: [msg]" + search results
    History: 6 most recent session messages
    |
    v
Response + product cards

Two AI calls per message. One dumb (search), one smart (response). Each with its own carefully scoped context window.

The Numbers

Metric	Before	After
Tokens per message	~1,820	~830
Cost per 100K messages	~$30	~$14
Reduction	—	55%

By adding a second AI call, total token usage went down by 55%. Less context, better results, lower cost.

Lessons Learned

1. AI Bugs Are Layered Like Onions

Each bug was invisible until I fixed the one above it. This is different from traditional software — AI bugs form stacks where one bad behavior masks another.

2. Embeddings Don't Understand Negation

"I don't want X" and "I want X" produce nearly identical embeddings. Don't embed raw text. Use a language model to interpret intent first.

3. Separation of Concerns Applies to AI Calls

Search needs amnesia. Response needs memory. Mixing them is how you get suits when someone asks for hoodies.

4. System Prompts Are Your Steering Wheel

When a long conversation history pulls the model in one direction, the system prompt is the only thing powerful enough to redirect it.

5. Test Topic Switches, Not Just Topic Continuation

The bugs only appeared when the customer changed their mind. Topic switches are where AI systems break. Make them a first-class test case.

Five bugs. Five fixes. Eight hours. One architecture that actually works.

And probably another five bugs hiding underneath, waiting for the right query to reveal them.

I'm building Provia — an AI-powered sales platform — from Gaza. I document every bug, every fix, and every architecture decision. Follow me @AliMAfana for the real version of building in public.

Previous articles:

Black Hat USA

AI Business

Runway AI Video Generator: Practical Workflow for Devs

Dev.to

Day 6: Why Real Health AI for India Needs 22 Languages, Not Just English

Dev.to

AIaaS: كيف تستفيد شركتك من الذكاء الاصطناعي بدون بناء فريق تقني كامل؟

Dev.to

الشات بوت العربي الذكي للشركات السعودية: استثمار استراتيجي في تجربة العميل وكفاءة العمليات

Dev.to

Key Points

The Setup: What Provia's AI Does

Bug 1: Summary Pollution — When Memory Becomes Contamination

The Symptom

The Investigation

The Fix

Bug 2: Raw Messages Make Terrible Search Queries

The Symptom

The Investigation

The Fix

Bug 3: Bot Reply Dominance — The Loudest Voice in the Room

The Symptom

The Investigation

The Fix

Bug 4: The Pajama Problem — When "Night" Means Everything

The Symptom

The Investigation

The Fix

Bug 5: The Response That Ignores Its Own Data

The Symptom

The Investigation

The Fix

The Final Architecture

The Numbers

Lessons Learned

1. AI Bugs Are Layered Like Onions

2. Embeddings Don't Understand Negation

3. Separation of Concerns Applies to AI Calls

4. System Prompts Are Your Steering Wheel

5. Test Topic Switches, Not Just Topic Continuation

Related Articles

Black Hat USA

Runway AI Video Generator: Practical Workflow for Devs

Day 6: Why Real Health AI for India Needs 22 Languages, Not Just English

AIaaS: كيف تستفيد شركتك من الذكاء الاصطناعي بدون بناء فريق تقني كامل؟

الشات بوت العربي الذكي للشركات السعودية: استثمار استراتيجي في تجربة العميل وكفاءة العمليات

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer