Gemma 4 is terrible with system prompts and tools

Reddit r/LocalLLaMA / 4/10/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author reports that Gemma 4 (26b-a4b) performs much better on general question answering than on agentic tasks requiring strict instruction adherence.
They claim the model degrades more noticeably than other models as the context window fills up.
They state Gemma 4 often disregards system prompts, even when multiple variants are tested and the constraints are explicit.
They say it rarely performs tool calls, even when explicitly instructed to do so.
The overall conclusion is that the model appears optimized for benchmark-style QA rather than reliable tool-using workflows, based on their experimentation.

Gemma 4 is terrible with system prompts and tools

I tried Gemma 4 (26b-a4b) and I was a bit blown away at how much better it is than other models. However, I soon found three things:

it gets significantly worse as context fills up, moreso than other models
it completely disregards the system prompt, no matter what I put in there
it (almost) never does tool calls, even when I explicitly ask it

Note: Other open models also have the same flaws, but they feel much more accentuated with Gemma. It feels like it was made to be great at answering general questions (for benchmarks), but terrible at agentic flows - following instructions and calling tools.

I tried countless system prompts and messages, including snippets like (just some of these, all of them in the same prompt, etc.)

<task> You must perform multiple tool calls, parallelizing as much as possible and present their results, as they include accurate, factual, verified information. You must follow a ZERO-ASSUMPTION protocol. DON'T USE anything that you didn't get from a TOOL or DIRECTLY FROM THE USER. If you don't have information, use TOOLS to get it, or ASK the user. DON'T ANSWER WITHOUT IT. Use the tools and your reasoning to think and answer the user's question or to solve the task at hand. DO NOT use your reasoning/internal data for ANY knowledge or information - that's what tools are for. </task> <tools> You have tools at your disposal - they're your greatest asset. ALWAYS USE TOOLS to gather information. NEVER TRUST your internal/existing knowledge, as it's outdated. RULE: ALWAYS PERFORM TOOL calls. Don't worry about doing "too many" calls. RULE: Perform tool calls in PARALLEL. Think that you need, what actions you want to perform, then try to group as many as possible. </tools> <reasoning> **CRUCIAL:** BEFORE ENDING YOUR REASONING AND ATTEMPTING TO ANSWER, YOU MUST WRITE: > CHECK: SYSTEM RULES THEN, YOU MUST compare your reasoning with the above system rules. ADJUST AS NEEDED. Most likely, you MUST: - perform (additional) tool calls, AND - realise assumptions, cancel them. NEVER ANSWER WITHOUT DOING THIS - THIS IS A CRITICAL ERROR. </reasoning>

These may not be the best prompts, it's what a lot of frustration and trial/error got me to, wtihout results however:

https://preview.redd.it/se1hq0v358ug1.png?width=842&format=png&auto=webp&s=dc3a11a12e871b79ef8a35f7b34666d5e55616bd

In the reasoning for the example above (which had the full system prompt from earlier) there is no mention of the word tool, system, check, or similar. Which is especially odd, since the model description states:

Gemma 4 introduces native support for the system role, enabling more structured and controllable conversations.

I then asked it what is it's system prompt, and it answered correctly, so it had access to it the whole time. It hallucianted when it tried to explain why it didn't follow it. I did get slightly better results by copy-pasting the system prompt into the user message.

Does anyone else have a different experience? Found any prompts that could help it listen or call tools?

submitted by /u/RealChaoz
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/10DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

MarkTechPost

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

The Register

I tested and ranked every ai companion app I tried and here's my honest breakdown

Reddit r/artificial

Gemma 4 is terrible with system prompts and tools

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

I tested and ranked every ai companion app I tried and here's my honest breakdown

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer