From Code to Prediction: Fine-Tuning LLMs for Neural Network Performance Classification in NNGPT

arXiv cs.CV / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper proposes a new LLM fine-tuning task within the NNGPT framework: predicting which of two image classification datasets a neural network architecture will perform better on, rather than evaluating generative artifacts after training.
It leverages the LEMUR dataset, using standardized PyTorch implementations and reproducible metrics, and tests three prompt strategies from easy (normalized-accuracy baseline) to harder (metadata-only and code-only prompts).
Fine-tuning DeepSeek-Coder-7B-Instruct with LoRA shows the code-only prompt performs best, reaching 80% peak accuracy over 15 epochs, outperforming the metadata prompt at 70%.
Per-dataset results indicate metadata helps most when dataset properties are distinctive, while code-only prompts remain more balanced; additional comparison with DeepSeek-Coder1.3B suggests reasoning depends on model capacity.
Overall, the study finds that fine-tuned LLMs can infer cross-dataset neural-network suitability from architecture source code, implying the code carries more discriminative information than dataset metadata alone.

Abstract

Automated Machine Learning (AutoML) frameworks increasingly leverage Large Language Models (LLMs) for tasks such as hyperparameter optimization and neural architecture code generation. However, current LLM-based approaches focus on generative outputs and evaluate them by training the produced artifacts. Whether LLMs can learn to reason about neural network performance across datasets remains underexplored. We present a classification task integrated into the NNGPT framework, in which a fine-tuned LLM predicts which of two image classification datasets a given neural network architecture achieves higher accuracy on. The task is built on the LEMUR dataset, which provides standardized PyTorch implementations with reproducible performance metrics. Three prompt configurations of increasing difficulty are evaluated: a normalized-accuracy baseline (trivially reaching 100%), a metadata-enriched prompt replacing accuracies with dataset properties, and a code-only prompt presenting only architecture source code and dataset names. Using DeepSeek-Coder-7B-Instruct fine-tuned with LoRA, the code-only prompt reaches 80% peak accuracy over 15 epochs, while the metadata prompt peaks at 70%. Perdataset analysis reveals complementary strengths: metadata excels for datasets with distinctive properties (CelebAGender at 90.9%) but degrades for overlapping characteristics, whereas the code-only prompt shows more balanced performance. A comparison with DeepSeek-Coder1.3B confirms that model capacity affects this form of architectural reasoning. The results establish that LLMs can be fine-tuned to predict cross-dataset suitability from neural network code, suggesting that architecture source code contains richer discriminative signal than dataset metadata alone.

Black Hat USA

AI Business

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

PaioClaw Review: What You Actually Get for $15/mo vs DIY OpenClaw

Dev.to

PaioClaw Review: What You Actually Get for $15/mo vs DIY OpenClaw

Dev.to

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

From Code to Prediction: Fine-Tuning LLMs for Neural Network Performance Classification in NNGPT

Key Points

Abstract

Related Articles

Black Hat USA

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

PaioClaw Review: What You Actually Get for $15/mo vs DIY OpenClaw

PaioClaw Review: What You Actually Get for $15/mo vs DIY OpenClaw

SIFS (SIFS Is Fast Search) - local code search for coding agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer