Gemma 4 2B handling structured JSON output + tool calling + reasoning traces correctly via Spring AI / LM Studio — including identifying a real Java bug in code review

Reddit r/LocalLLaMA / 5/24/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A developer reported successfully running Google’s Gemma 4 2B locally via LM Studio (OpenAI-compatible endpoint) and calling it from a Spring Boot app using Spring AI’s ChatClient.
  • In structured-output tests using Spring AI’s BeanOutputConverter, the model returned schema-conformant JSON without markdown wrapping and accurately generated a Java code review (including identifying a real string comparison bug).
  • For tool calling, the model correctly selected a registered weather tool, extracted the location parameter (“Riga”), invoked the tool, and then returned the result in natural language without prompting.
  • LM Studio also exposed a reasoning_content field showing step-by-step thinking prior to the final output, suggesting the app can capture explicit reasoning traces.
  • The author asked whether others have benchmarked Gemma 4 2B versus Phi-4 or Qwen 2.5 3B for structured-output reliability, and what the smallest reliable model is for parallel tool calls and production-grade latency (p99) under load.

Wanted to share a result I didn't expect to work.

Running google/gemma-4-e2b locally through LM Studio, exposed via OpenAI-compatible endpoint, called from a Spring Boot app using Spring AI's ChatClient abstraction. Three things I tested:

  1. STRUCTURED OUTPUT (schema-conformant JSON)

Used BeanOutputConverter to force the model to return a CodeReview object with specific fields (issues, qualityScore, suggestions, summary). Sent it a Java snippet with a == vs .equals() string comparison bug.

Result: Perfect JSON, no markdown wrapping, all fields populated correctly. Correctly identified the bug AND suggested a Streams refactor. Quality score 50/100 — interestingly identical to what Claude Sonnet 4.6 returned on the same input, while GPT-4o was less strict and gave 55.

  1. TOOL CALLING

Registered a weather function with @Tool annotation. Asked "should I bring an umbrella in Riga?".

Result: Model correctly decided to invoke the tool, extracted "Riga" as the location parameter, received the mock weather response, and wrapped it back into natural language. No hand-holding, no "I would call the weather tool if I had access" — it actually called it.

  1. REASONING TRACES

LM Studio's response included a reasoning_content field showing step-by-step thinking before the final JSON output. Not just generated tokens — the model worked through the analysis explicitly:

Thinking Process:

  1. Analyze the Request: The user wants a review...

  2. Analyze the Code: ...

  3. Identify Issues/Improvements:

- Issue 1 (String Comparison): == vs .equals()

- Issue 2 (Style/Readability): index-based loop vs streams

  1. Formulate Suggestions...

The full demo is in a video I made walking through the setup, including a WiFi-off test to prove the inference is genuinely local: https://youtu.be/lW0FMjDUzik

What I'm curious about:

- Has anyone benchmarked Gemma 4 2B vs Phi-4 vs Qwen 2.5 3B for structured output reliability specifically? My anecdotal experience is Gemma is more schema-faithful, but I haven't run rigorous tests.

- For tool calling with parallel function calls (multiple tools in one response), where does the smallest reliable model sit right now?

- Anyone running this size of model in production behind real workloads? I'm specifically interested in latency p99 numbers under load, not just single-request demos.

submitted by /u/Proof-Possibility-54
[link] [comments]