Instruction-Tuned, but Not More Verifiable Instruction-Following: A Cross-Task Diagnosis for LoRA Adapters

arXiv cs.LG / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tests whether “nominal” training labels for LoRA adapters (e.g., instruction-tuned) reliably predict realized cross-task capability gains when the same adapter is evaluated across tasks.
  • Using IFEval as a strict, automatically verifiable target for instruction following, the authors find that nominal labels often fail to forecast improvements, showing configuration sensitivity with some near-zero or negative cases.
  • In a controlled instruction-versus-numeric example, an instruction-tuned adapter dramatically improves off-target numeric benchmark performance but does not improve verifiable instruction following on IFEval, illustrating a “capability drift” mismatch.
  • The mismatch is observable in the raw cross-task performance matrix, and the authors use a drift score only as a compact summary rather than introducing a new formal metric.
  • Results on broader instruction-following benchmarks are mixed and benchmark-dependent, leading to a practical recommendation to run routine cross-task evaluation before deployment and not treat nominal labels as dependable capability proxies.

Abstract

Adapters are often selected and deployed based on nominal labels (e.g., instruction-tuned), which implicitly suggest what capability improves after adaptation. We test whether nominal training objectives reliably align with realized cross-task capability gains by evaluating the same LoRA adapter across tasks. Our strongest evidence is tied to strict, automatically verifiable instruction following as measured by IFEval: across multiple seeds, base models, and LoRA settings, nominal labels recurrently but not universally fail to predict improvements on this verifiable target, with clear configuration sensitivity including a near-zero or negative case. As an illustrative strongest-case example in a controlled instruction-versus-numeric setting, an instruction-tuned adapter substantially improves off-target NM-based numeric benchmark performance from 0.133 to 0.632 while not improving verifiable instruction following on IFEval (ILA: 0.313 to 0.271; PLA: 0.250 to 0.143; values rounded to three decimals). We refer to this nominal-versus-realized mismatch pattern as capability drift as a descriptive label. The mismatch is visible in the raw cross-task performance matrix; we use a drift score only as a compact summary in the same units as the underlying metrics, not as a new formal metric contribution. Evidence from broader instruction-following benchmarks is benchmark-dependent and mixed, reflecting heterogeneity in how instruction following is operationalized; we therefore do not treat cross-benchmark agreement as a premise. Overall, the practical takeaway is to perform routine cross-task evaluation before deployment and to avoid treating nominal labels as reliable capability proxies.