RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI

arXiv cs.AI / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The study proposes RadLite, showing that 3–4B parameter small language models can deliver strong multi-task radiology performance by using LoRA fine-tuning rather than relying on resource-heavy LLM deployment.
  • Researchers fine-tuned Qwen2.5-3B-Instruct and Qwen3-4B on 162K samples covering nine radiology tasks (including RADS classification, impression generation, NLI/NER, staging, abnormality detection, and radiology Q&A) compiled from 12 public datasets.
  • LoRA fine-tuning substantially outperforms zero-shot baselines, with reported gains such as RADS accuracy +53%, NLI +60%, and N-staging +89%.
  • The two models provide complementary capabilities (Qwen2.5 better at structured generation, Qwen3 stronger at extractive tasks), and a task-specific oracle ensemble of both yields the best overall results.
  • For real-world deployment, the models can be quantized to GGUF (~1.8–2.4GB) enabling CPU-only use at about 4–8 tokens/second on consumer hardware, and the authors find fine-tuned few-shot prompting can reduce performance, suggesting LoRA adaptation works better than in-context learning for this domain.

Abstract

Large language models (LLMs) show promise in radiology but their deployment is limited by computational requirements that preclude use in resource-constrained clinical environments. We investigate whether small language models (SLMs) of 3-4 billion parameters can achieve strong multi-task radiology performance through LoRA fine-tuning, enabling deployment on consumer-grade CPUs. We train Qwen2.5-3B-Instruct and Qwen3-4B on 162K samples spanning 9 radiology tasks - RADS classification across 10 systems, impression generation, temporal comparison, radiology NLI, NER, abnormality detection, N/M staging, and radiology Q&A - compiled from 12 public datasets. Both models are evaluated on up to 500 held-out test samples per task with standardized metrics. Our key findings are: (1) LoRA fine-tuning dramatically improves performance over zero-shot baselines (RADS accuracy +53%, NLI +60%, N-staging +89%); (2) the two models exhibit complementary strengths - Qwen2.5 excels at structured generation tasks while Qwen3 dominates extractive tasks; (3) a task-outed oracle ensemble combining both models achieves the best performance across all tasks; (4) few-shot prompting with fine-tuned models hurts performance, demonstrating that LoRA adaptation is more effective than in-context learning for specialized domains; and (5) models can be quantized to GGUF format (~1.8-2.4GB) for CPU deployment at 4-8 tokens/second on consumer hardware. Our work demonstrates that small, efficiently fine-tuned models - which we collectively call RadLite - can serve as practical multi-task radiology AI assistants deployable entirely on consumer hardware without GPU requirements.