Parametric Knowledge and Retrieval Behavior in RAG Fine-Tuning for Electronic Design Automation

arXiv cs.CL / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how retrieval-augmented generation (RAG) fine-tuning behaves when applied to long-form text generation tasks in electronic design automation (EDA), using a 7B model under multiple context augmentation and retrieval conditions.
  • It proposes TriFEX, a human-validated triple-based evaluation pipeline that traces each generated claim back to its origin (query, context, or references) and introduces Parametric Knowledge Precision (PKP) to separate internally learned knowledge from prompt-leaked content.
  • The authors find that common automatic metrics like ROUGE and BERTScore do not reliably detect factual differences that TriFEX/PKP reveal in RAG outputs.
  • They also show that an existing “knowledge internalization” metric is largely retrieval-sensitive: about 75% of cross-condition variance comes from changes in how often internal knowledge is expressed rather than changes in correctness measured by PKP.
  • Experimental results indicate that fine-tuned 7B variants outperform a 72B baseline on most metrics and generalize across conditions, supporting the feasibility of smaller, cost-efficient, on-prem deployment for specialized RAG workloads.

Abstract

Retrieval-Augmented Generation (RAG) fine-tuning has shown substantial improvements over vanilla RAG, yet most studies target document question answering and often rely on standard NLP metrics that can obscure factual differences. We evaluate RAG fine-tuning for long-form text generation in electronic design automation, adapting a 7B model under five context augmentation strategies with varying retrieval conditions. We introduce TriFEX, a human-validated, triple-based evaluation pipeline that attributes generated claims to their origin-user query, context and reference-and propose Parametric Knowledge Precision (PKP), which isolates internalized knowledge by filtering out claims leaked in the prompt. We show that ROUGE and BERTScore fail to detect factual differences that our triple-based evaluation reveals. Additionally, we demonstrate that an existing metric for knowledge internalization is retrieva-sensitive, with about 75% of its cross-condition variance driven by changes in the rate at which internal knowledge is expressed (PR), rather than by changes in its actual correctness (PKP). The fine-tuned 7B variants outperform a 72B baseline on most metrics, further showing generalization across conditions and on a related benchmark. These results underscore the limitations of available metrics in RAG evaluation and show that smaller models could be reasonably well adapted to specialized tasks for cost-efficient, on-premises deployment.