From Code to Prediction: Fine-Tuning LLMs for Neural Network Performance Classification in NNGPT

arXiv cs.CV / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper proposes a new LLM fine-tuning task within the NNGPT framework: predicting which of two image classification datasets a neural network architecture will perform better on, rather than evaluating generative artifacts after training.
  • It leverages the LEMUR dataset, using standardized PyTorch implementations and reproducible metrics, and tests three prompt strategies from easy (normalized-accuracy baseline) to harder (metadata-only and code-only prompts).
  • Fine-tuning DeepSeek-Coder-7B-Instruct with LoRA shows the code-only prompt performs best, reaching 80% peak accuracy over 15 epochs, outperforming the metadata prompt at 70%.
  • Per-dataset results indicate metadata helps most when dataset properties are distinctive, while code-only prompts remain more balanced; additional comparison with DeepSeek-Coder1.3B suggests reasoning depends on model capacity.
  • Overall, the study finds that fine-tuned LLMs can infer cross-dataset neural-network suitability from architecture source code, implying the code carries more discriminative information than dataset metadata alone.

Abstract

Automated Machine Learning (AutoML) frameworks increasingly leverage Large Language Models (LLMs) for tasks such as hyperparameter optimization and neural architecture code generation. However, current LLM-based approaches focus on generative outputs and evaluate them by training the produced artifacts. Whether LLMs can learn to reason about neural network performance across datasets remains underexplored. We present a classification task integrated into the NNGPT framework, in which a fine-tuned LLM predicts which of two image classification datasets a given neural network architecture achieves higher accuracy on. The task is built on the LEMUR dataset, which provides standardized PyTorch implementations with reproducible performance metrics. Three prompt configurations of increasing difficulty are evaluated: a normalized-accuracy baseline (trivially reaching 100%), a metadata-enriched prompt replacing accuracies with dataset properties, and a code-only prompt presenting only architecture source code and dataset names. Using DeepSeek-Coder-7B-Instruct fine-tuned with LoRA, the code-only prompt reaches 80% peak accuracy over 15 epochs, while the metadata prompt peaks at 70%. Perdataset analysis reveals complementary strengths: metadata excels for datasets with distinctive properties (CelebAGender at 90.9%) but degrades for overlapping characteristics, whereas the code-only prompt shows more balanced performance. A comparison with DeepSeek-Coder1.3B confirms that model capacity affects this form of architectural reasoning. The results establish that LLMs can be fine-tuned to predict cross-dataset suitability from neural network code, suggesting that architecture source code contains richer discriminative signal than dataset metadata alone.