Debugging AI Agents in Production: ADK+Gemini Cloud Assist | Google Cloud NEXT '26

Dev.to / 4/25/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The article argues that production failures in AI-agent systems are increasingly caused by plausible but incorrect agent decisions rather than straightforward software bugs.
  • It explains how Google’s Agent Development Kit (ADK) shifts developers away from explicitly writing logic toward defining the agent’s goals, tools, and knowledge, letting the agent decide execution.
  • Using a Marathon Planner Agent example, the piece describes how agents combine instructions with tool access (e.g., Google Maps via MCP) and domain skills (e.g., GIS logic).
  • It highlights that multi-agent behavior complicates debugging because interactions can lead to unintended outcomes even when each step appears reasonable.
  • It presents Gemini Cloud Assist as a debugging layer to help developers diagnose and troubleshoot these agent-driven issues in production.

This is a submission for the Google Cloud NEXT Writing Challenge

Google Cloud NEXT '26 quietly introduced a problem most developers are not ready for.

Your system no longer fails because of a bug.
It fails because an agent made a reasonable decision that turned out to be wrong.

That difference sounds subtle.
It isn’t.

Trust me, this is such a pain in the butt, I'm saying this coz I worked for both Gemini 3 Hackathon and Gemini Live Agent Challenge, and I know how easy it is to fall into such traps.

This article walks through that shift using what Google actually demonstrated on stage:

  1. how the Agent Development Kit (ADK) changes development
  2. how multi-agent systems behave in production
  3. and how Gemini Cloud Assist becomes your debugging layer

Code Writing Code, and Code Acting on it

The keynote doesn't begin with infrastructure or APIs. It starts with something more unsettling.

Music is generated using AI. Visuals are rendered live.
And those visuals? Generated by code that Gemini writes in real time based on audio input.

This is the pattern the rest of the keynote follows, but more importantly, it's the pattern we now have to debug.

You can check the visuals at the start till 02:00. These were created using Veo, Nano Banana, Gemini Flash Live and everything is done using Music AI Sandbox

ADK: You're Not Writing Logic Anymore

At the center of everything is the Agent Development Kit (ADK).

At first glance, it looks like just another framework. But it changes something fundamental: You don't define how things happen anymore.

You define:

  1. what the agent is supposed to do
  2. what tools it has access to
  3. what knowledge it can use

And then… you let it decide.

During the keynote, Richard and Emma builds a Marathon Planner Agent. Not a function. Not a service. An agent.

It is given:

instructions (plan a marathon route)
tools (Google Maps via MCP)
skills (GIS logic, race planning rules)

From there, it figures things out.

No explicit control flow. No step-by-step orchestration.

Marathon Simulation

The Subtle but Dangerous Shift

In a normal system, if something goes wrong, you know where to look. In an ADK-based system:

  • The agent may choose the wrong tool
  • or use the right tool incorrectly
  • or interpret the prompt differently
  • or combine context in unexpected ways
  • or a whole new level of problem that we haven't yet figured out

Nothing is strictly "broken". It just… behaves incorrectly.

When One Agent Isn't Enough

The demo quickly evolves beyond a single agent. Instead of forcing one agent to do everything, they split responsibilities:

  1. a Planner Agent proposes routes
  2. an Evaluator Agent scores them
  3. a Simulator Agent runs the world

This is where things start to look less like software and more like a system of collaborators. These agents don't call APIs directly. They discover each other.

Google introduces:

  • A2A (Agent-to-Agent protocol) => how agents communicate
  • Agent Registry => how agents find each other

Think of it as DNS for agents.

Multi-Agent Workflow

The Most Underrated Feature: Agents Build Their Own UI

One of the most interesting moments in the keynote is easy to miss.

The UI isn't manually built. The agent generates it. Using something called A2UI, the agent:

  1. decides how results should be displayed
  2. constructs components
  3. renders them dynamically

This removes an entire layer of development.

Context Engineering Is Where Systems Break

As the system evolves, more data is introduced:

  • city regulations
  • traffic constraints
  • historical patterns

This is handled through:

  • sessions (state across interactions)
  • memory (long-term knowledge)
  • RAG (retrieval from databases)

The agent starts behaving more intelligently.

It also becomes far more fragile. At one point, the agent learns: "You can't have a camel on public roads"

Funny in isolation. Critical when that rule influences route planning.

Debugging Stops Being Mechanical

In a traditional system, you would:

  1. check logs
  2. inspect stack traces
  3. fix the code

Here, none of that is sufficient. You need to answer:

  • why did the agent choose this tool?
  • why did it carry this context forward?
  • why did memory grow uncontrollably?

That's not debugging code. That's debugging reasoning.

Gemini Cloud Assist: The Real Innovation

Google's answer is not better logs. It's an AI system that debugs your AI system. Gemini Cloud Assist acts as:

  • investigator
  • debugger
  • infra operator
  • code assistant

When the failure happens, it:

  • analyzes logs
  • inspects traces
  • reads your code
  • correlates infra issues
  • identifies root cause

And then it suggests a fix.

Gemini Cloud Assist

What Actually Broke?

The root cause in the demo:

  • context grew too large
  • exceeded Gemini's token limit
  • event compaction wasn't frequent enough

The fix wasn't a rewrite. It was a behavioral adjustment:

  • compress context more frequently
  • reduce memory footprint per step

Everything is fine.

Now, if you think I'm gonna leave you hanging after all these intro...

You are wrong doe...

So far we've seen what it can do, now it’s time to use it

So far, everything we discussed lives in the keynote.

Cool demos. Fancy systems. "Wow, agents!"

But none of that matters unless we can actually build something that behaves like that.

So instead of jumping straight into "multi-agent, cloud-native, distributed magic"… we start small. Controlled. Understandable.

We build a system where:

  • an agent makes a decision
  • that decision actually affects something real
  • and we can see the impact visually

Step 1: Define the World

Before bringing Gemini into the picture, I need a system that can react to decisions.

So I'll build a simple simulation:

  • a route (sequence of coordinates)
  • runners moving along that route
  • a visualization of their positions over time

At this stage, everything is deterministic.

Route

Then convert this into a dense path:

BUild dense path

And simulate runners:
Simulator

Each runner:

  • moves at a slightly different speed
  • has small randomness
  • doesn’t perfectly overlap with others

This gives us something that already looks like a race.

Straight race

Step 2: Bring in Gemini

Now comes the important part. We don’t ask Gemini to generate coordinates.
That’s a trap.

Instead, we constrain it. We define a few route templates:

Route templates

Now Gemini’s job is simple: Pick the type of route.

Step 3: The Planner Agent

Planner agent

Notice what we did here:

  • limited output space
  • avoided parsing nightmares
  • kept the system predictable

This is exactly how you should use LLMs in systems.

Step 4: Connect Decision => Behavior

Now wire everything together:
Running everything

What You’re Actually Seeing

It represents:

  • position => where runners are
  • color => how far they’ve progressed
  • shape => the route chosen by Gemini

Change the prompt, and the route changes. Change the route, and the entire distribution changes.

Curved path selection:
Curved path

Step 5: When It Broke (and Nothing Looked Broken)

At some point, the system started behaving… oddly.

Gemini consistently chose curved routes, even when the prompt clearly favored straight ones.

Nothing failed.

No exceptions.
No crashes.
No warnings.

The simulation ran perfectly. But the output distribution was wrong.

At first, it looked like randomness. Then it looked like bias. Eventually, it became clear: the model was over-weighting certain keywords in the prompt and mapping them incorrectly to route templates.

The problem wasn’t in the simulation.
It wasn’t in the data.
It was in how the agent interpreted intent.

Debugging this felt very different from normal debugging:

There was no single place to look

no clear cause-and-effect chain

The only behavior that emerged over multiple runs

The fix wasn’t a code change.

It was:

  • tightening the prompt
  • reducing ambiguity
  • making output constraints stricter

The system didn’t become “correct”.
It became less wrong.

That’s the mindset shift with non-deterministic systems: In non-deterministic systems, correctness isn’t a state.
It’s a range you try to keep within acceptable bounds.

Why This Matters

At this point, Gemini is not "doing everything". It’s doing something more important:

It decides the conditions under which the system runs.

That’s the shift.

We’ve moved from static code controlling behavior to AI influencing system dynamics

What You Just Did

You didn't debug code.

You debugged behavior.

You constrained decision space.
You shaped how the agent interprets intent.
You reduced how wrong the system can be.

That’s a fundamentally different skill. Because in these systems, correctness is not guaranteed. It is negotiated.

Note: This isn’t meant to match the keynote. It’s a minimal example showing a bigger idea: shifting from writing fixed logic to building systems that decide how to behave at runtime.

Final Takeaway

Google didn't just launch tools. It revealed a shift:

Software is no longer deterministic execution
It is probabilistic decision-making

And that means:

  • debugging is harder
  • observability is critical
  • architecture matters more than ever
  • Closing Thought

The hardest bug in the future isn't:
"Why did this fail?"
It’s:
"Why did the system think this was correct?"
Because we didn’t just make software more powerful.
We made it capable of being wrong in far more complex ways.

Waiting for the day a hotfix pops up: “Fix the AI pipeline” 😂. Thankfully, we're on Google's stack, so at least I'll have the right tools when it happens.