I've created a LoRA for Gemma 3 270M making it probably the smallest thinking model?

Reddit r/LocalLLaMA / 4/29/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Read original →

共有:

Key Points

A developer released a LoRA fine-tune for the Gemma 3 270M model on Hugging Face, claiming it may be among the smallest “thinking” models available.
The LoRA was built for local function-calling support first, then extended with a custom “thinking” format and tags, using a mixed procedurally generated dataset plus data from larger models.
The author reports that keeping the exact output format was the main challenge, leading to specific training choices (rank 24, max sequence length 768, and a custom loss that heavily penalizes missing proper tags).
Training constraints are significant: the author used an RTX 3050 4GB with batch size 1 and gradient accumulation, and notes that tuning gradient accumulation steps substantially affected stability and prevented the model from producing garbled output.
They provide key SFT and LoRA configuration parameters (learning rate, scheduler, LoRA target modules like q_proj/v_proj, and hyperparameters such as dropout and loss behavior) and mention that changing LoRA alpha to 2× rank degraded performance.

https://huggingface.co/firstbober/gemma-3-270M-it-smol-thinker

Here is an example of the output:
```
==================== THINKING ==================== Here is the thinking process:

This is a large community with a wide range of interests
Users can ask questions, share experiences, and discuss local events
The rules are generally open-ended and allow for creativity
However, the rules may be unclear or incomplete <|thinking_end|>

==================== RESPONSE ====================

r/LocalLLaMA is a large, open-source question answering subreddit. Its rules are generally open-ended, allowing users to ask questions and share their experiences. However, the rules might be unclear or incomplete depending on the current state of the community.

<|response_end|>
```

It doesn't have much knowledge baked in, but with prompting it can give some interesting results.

Lore:

I've been working for a few days on it. First I just wanted to adapt it locally for function calling without using FunctionGemma. When it worked out (more or less) I moved to adding some thinking. The dataset was procedurally generated + some with Qwen 3.6 35B A3B (Q4 quants) + GLM 5.1.

The biggest hurdle was figuring out how to make it keep the format, I settled for rank 24, 768 max length for training data, and customized loss function which gives 20x for not using proper tags. Due to that the loss stayed at around 7, but the effect is there.

I've wanted to add longer examples, but my RTX 3050 4GB Mobile is kinda not enough, with train batch size of 1 and gradient accumulation step of 2 this is the best I could do.

Another interesting thing, Claude/Gemini were saying that bigger gradient_accumulation_steps essentially meant larger batch size but without actually increasing the batch size. This accounted for like 40% of all of my headaches, with model spitting utter garbage and random chinese slop characters.

Well, I think that's all, here are all the relevant training parameters:
```
SFTConfig:

per_device_train_batch_size=1, gradient_accumulation_steps=2, per_device_eval_batch_size=1, learning_rate=1e-4, lr_scheduler_type="cosine", warmup_ratio=0.10, weight_decay = 0.1, load_best_model_at_end=True,

LoraConfig:

n_rank = 24 r=n_rank, lora_alpha=n_rank, target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.15, task_type="CAUSAL_LM",
```

Oh, also increasing alpha to 2x rank as recommended in paper kinda broke everything, this is another thing that was pretty frustrating to figure out.

I plan to continue and train some more adapters with other ideas, maybe I'll switch to Qwen 3.5 0.8B when I buy a card with enough VRAM? I don't know. One thing I'll definitely do is thinking adapter for FunctionGemma, as it would fix my issues with function calling to some degree.

submitted by /u/Firstbober
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/29DailyView insight →

Black Hat USA

AI Business

How are LLMs 'corrected' when users identify them spreading misinformation or saying something harmful?

Reddit r/artificial

The Landing: Portable Payload for AI Systems

Reddit r/artificial

I Made a CLI That Yells at Your Code Until It Gets an A

Dev.to

BizNode Pro: run up to 5 independent Telegram bots, each with its own identity, knowledge base, and AI persona

Dev.to

I've created a LoRA for Gemma 3 270M making it probably the smallest thinking model?

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

How are LLMs 'corrected' when users identify them spreading misinformation or saying something harmful?

The Landing: Portable Payload for AI Systems

I Made a CLI That Yells at Your Code Until It Gets an A

BizNode Pro: run up to 5 independent Telegram bots, each with its own identity, knowledge base, and AI persona

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer