First time fine-tuning, need a sanity check — 3B or 7B for multi-task reasoning? [D]

Reddit r/MachineLearning / 4/23/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • A new fine-tuning project is being planned to teach a single model three related “reasoning” behaviors: reading latent intent beneath questions, preserving multiple perspectives, and disentangling the main thread from noise in messy inputs.
  • The author is deciding between a ~3B parameter LoRA setup and a ~7B model given hardware limits (Mac M4 with 24GB) and a dataset of about 40–60k generated training examples.
  • They are concerned whether a smaller model can support multiple procedural reasoning modes without confusing them, and whether “related but not identical” tasks are harder than fully separate tasks.
  • They want advice grounded in experience and pointers to relevant papers or studies on multi-task fine-tuning for reasoning-like data at this dataset/model scale.
  • The post frames the question as a “sanity check” before committing, rather than seeking generic advice to simply try both approaches.

Ok so this is my first post here, been lurking for a while. I’m about to start my first fine-tuning project and I don’t want to commit to the wrong direction so figured I’d ask.

Background on me: I’m not from an ML background, self-taught, been working with LLMs through APIs for about a year. Hit the wall where prompt engineering isn’t enough anymore for what I’m trying to do, so now I need to actually fine-tune something.

Here’s the task. I want the model to learn three related things:

First, reading what’s actually going on underneath someone’s question. Like, when someone asks “should I quit my job” the real question is rarely about the job, it’s about identity or fear or something else. Training the model to see that underneath layer.

Second, holding multiple perspectives at once without collapsing to one too early. A lot of questions have legitimate different angles and I want the model to not just pick one reflexively.

Third, when the input is messy or has multiple tangled problems, figuring out which thread is actually the load-bearing one vs what’s noise.

These three things feel related to me but they’re procedurally different. Same underlying skill (reading what’s really there) applied three ways.

So the actual question: is 3B enough for this or do I need 7B? Was thinking Phi-4-mini for 3B or Qwen 2.5 7B otherwise. I have maybe 40-60k training examples I can generate (using a bigger model as teacher, sourcing from philosophy, psych case studies, strategy lit).

Hardware is M4 Mac with 24gb unified. 3B fits comfortably with LoRA, 7B is tight but doable. Happy to rent gpu if needed.

What I’m actually worried about:

• Can 3B hold three related reasoning modes without confusing them on stuff that’s outside the training distribution • Does the “related but not identical” thing make this harder to train than if they were totally separate tasks • What do I not know that’s gonna bite me 

Not really looking for “just try both” type answers. More interested if anyone has actually done multi-task training on reasoning-ish data at this scale and can tell me where it went sideways.

Any pointers appreciated, even just papers to read if the question is too vague.

submitted by /u/retarded_770
[link] [comments]