Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • ChatGPT 5.4をウェブサイトのチャットボットに統合し、主要サイト、1,000商品のECデモ、570ページの料理ブログという異なる環境でRAGライクに挙動を観察する実験を行いました。
  • 目的は現実に近い利用行動を模し、商品比較、価格条件による絞り込み、複数エンティティ(複数商品・カテゴリ)への問い合わせ、買い物意図が複雑なケースなどの反復パターンをテストすることでした。
  • 実際のユーザーが「どうやってECを助けられるか」と質問した際の回答が、事前に手動で検証していた相互作用パターンと非常に近い形で一致している点が注目されています。
  • 反復利用が出力に影響している可能性として、時間経過に伴うプロンプト的な条件付け、RAGによる文脈形成、反復された意味構造の潜在的な活性化、セッション/インタラクション単位のバイアスなどが挙げられ、測定可能な形での影響の有無がオープンな論点になっています。
  • 開発者にとって重要なのは、影響があるなら「評価」だけでなく「システム挙動設計の一部」としてテストや観測が位置付けられる可能性があり、同様の現象に気づいたかどうかを他者と共有することだと述べています。

We’ve been running a series of experiments using ChatGPT 5.4 integrated into a website chatbot across different environments:

🌐 a main website
🛒 a 1,000-product e-commerce demo store
🍳 a 570-page cooking blog

🎯 Goal: simulate realistic user behavior and observe how the model responds over time.

⚙️ Test setup

The chatbot is designed to (no self promo here, just context):

📌 answer strictly based on website content (RAG-like approach)
🧭 guide users through product discovery and content navigation

Over time, we intentionally tested recurring patterns:

🔎 product comparisons
💰 price-based filtering
🔀 cross-entity queries (multiple products, categories)
🧠 more complex “shopping intent” scenarios

💡 The idea was to approximate real-world usage, not synthetic benchmarks.

👀 Observation

At some point, a real user (yes, a real one) asked:

“How can you help my ecommerce?”

The answer was:

“I can help your e-commerce by answering visitors [...], [...] for example asking how many people they cook for to recommend the right cast iron pot, or asking for a price range to help them find products [...]”

🔍 What’s interesting

This response closely mirrors the exact interaction patterns we had been testing manually.

It wasn’t a generic explanation.
It reflected:

👉 guided questioning
👉 contextual recommendations
👉 progressive narrowing of user intent
🧠 Hypothesis

From a system behavior perspective, it feels like repeated usage patterns influence outputs in a given context.

Possible explanations:

🧩 Prompt conditioning over time (consistent system + user patterns)
📚 Context shaping via retrieved content (RAG)
🔁 Latent pattern activation due to repeated semantic structures
🧷 Session-level or interaction-level biasing
❓ Open question

This leads to a broader question for builders:

👉 When deploying LLMs in structured environments (chatbots, RAG systems, product assistants), does repeated real-world usage shape outputs in a measurable way?

👉 Or are we just observing better alignment due to consistent prompting + context injection?

🚀 Why this matters

If usage patterns do influence outputs (even indirectly), then:

🧪 testing is not just evaluation
🏗️ it becomes part of system behavior design
📈 and potentially a lever for optimization
💬 Curious to hear from others

If you’re working with:

RAG pipelines
production chatbots
LLM-powered assistants

Have you noticed similar effects?

Does your system behave differently after repeated real-world usage patterns?

Let’s compare notes 👇