Gemma 4 2Bが構造化JSON出力・ツール呼び出し・推論トレースをSpring AI / LM Studioで正しく処理し、コードレビューで実際のJavaのバグも発見

Reddit r/LocalLLaMA / 2026/5/24

💬 オピニオンDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

要点

開発者は、LM Studio（OpenAI互換エンドポイント）でGoogleのGemma 4 2Bをローカル実行し、Spring BootアプリからSpring AIのChatClientで呼び出せたと報告しています。
Spring AIのBeanOutputConverterを使った構造化出力のテストでは、モデルがマークダウンなしでスキーマに準拠したJSONを返し、Javaのコードレビュー（実際の文字列比較のバグ検出を含む）も正確に生成しました。
ツール呼び出しでは、登録した天気ツールを適切に選び、位置情報（「Riga」）を抽出して実際にツールを呼び出し、その結果を自然言語で返すことに成功しています。
LM Studioはreasoning_contentフィールドとして、最終出力の前に段階的な思考を示す推論トレースを返しており、アプリ側で明示的な推論を取得できることを示しています。
著者は、Gemma 4 2BとPhi-4やQwen 2.5 3Bの「構造化出力の信頼性」ベンチマーク、複数ツール呼び出し（並列）の小さな信頼できるモデル水準、さらに本番負荷下でのレイテンシp99などを誰かが測っているかを質問しています。

Wanted to share a result I didn't expect to work.

Running google/gemma-4-e2b locally through LM Studio, exposed via OpenAI-compatible endpoint, called from a Spring Boot app using Spring AI's ChatClient abstraction. Three things I tested:

STRUCTURED OUTPUT (schema-conformant JSON)

Used BeanOutputConverter to force the model to return a CodeReview object with specific fields (issues, qualityScore, suggestions, summary). Sent it a Java snippet with a == vs .equals() string comparison bug.

Result: Perfect JSON, no markdown wrapping, all fields populated correctly. Correctly identified the bug AND suggested a Streams refactor. Quality score 50/100 — interestingly identical to what Claude Sonnet 4.6 returned on the same input, while GPT-4o was less strict and gave 55.

TOOL CALLING

Registered a weather function with @Tool annotation. Asked "should I bring an umbrella in Riga?".

Result: Model correctly decided to invoke the tool, extracted "Riga" as the location parameter, received the mock weather response, and wrapped it back into natural language. No hand-holding, no "I would call the weather tool if I had access" — it actually called it.

REASONING TRACES

LM Studio's response included a reasoning_content field showing step-by-step thinking before the final JSON output. Not just generated tokens — the model worked through the analysis explicitly:

Thinking Process:

Analyze the Request: The user wants a review...
Analyze the Code: ...
Identify Issues/Improvements:

- Issue 1 (String Comparison): == vs .equals()

- Issue 2 (Style/Readability): index-based loop vs streams

Formulate Suggestions...

The full demo is in a video I made walking through the setup, including a WiFi-off test to prove the inference is genuinely local: https://youtu.be/lW0FMjDUzik

What I'm curious about:

- Has anyone benchmarked Gemma 4 2B vs Phi-4 vs Qwen 2.5 3B for structured output reliability specifically? My anecdotal experience is Gemma is more schema-faithful, but I haven't run rigorous tests.

- For tool calling with parallel function calls (multiple tools in one response), where does the smallest reliable model sit right now?

- Anyone running this size of model in production behind real workloads? I'm specifically interested in latency p99 numbers under load, not just single-request demos.

submitted by /u/Proof-Possibility-54
[link] [comments]

Black Hat USA

AI Business

Google、「Dart＆Flutter Agent Skills」リリース。DartとFlutter開発の最新ベストプラクティスをAIエージェントに提供

Publickey

ソフトバンクとインテルの新メモリー、チップを立て「磁界結合」でHBM超え

日経XTECH

【DS向け】民間ビジネスで需要拡大する衛星画像解析（IMINT/GEOINT） ── NDVI・SAR・YOLOによるPython実践と経済安全保障【最終回】

Qiita

【Fedora Linux × IntelliJ】新世代AIエージェント Antigravity 導入・連携ガイド

Zenn

Gemma 4 2Bが構造化JSON出力・ツール呼び出し・推論トレースをSpring AI / LM Studioで正しく処理し、コードレビューで実際のJavaのバグも発見

要点

関連記事

Black Hat USA

Google、「Dart＆Flutter Agent Skills」リリース。DartとFlutter開発の最新ベストプラクティスをAIエージェントに提供

ソフトバンクとインテルの新メモリー、チップを立て「磁界結合」でHBM超え

【DS向け】民間ビジネスで需要拡大する衛星画像解析（IMINT/GEOINT） ── NDVI・SAR・YOLOによるPython実践と経済安全保障【最終回】

【Fedora Linux × IntelliJ】新世代AIエージェント Antigravity 導入・連携ガイド

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer