AI Navigate

Resource-Efficient Iterative LLM-Based NAS with Feedback Memory

arXiv cs.LG / 3/13/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper presents a closed-loop NAS pipeline that uses large language models (LLMs) to iteratively generate, evaluate, and refine convolutional neural network architectures for image classification on a single consumer-grade GPU without LLM fine-tuning.
  • It introduces a sliding-window feedback memory of K=5 that records structured diagnostic triples (problem, suggested modification, outcome) to preserve and learn from failure trajectories.
  • It uses a dual-LLM setup—a Code Generator for executable PyTorch architectures and a Prompt Improver for diagnostic reasoning—to reduce per-call cognitive load and VRAM usage.
  • It reports substantial accuracy improvements across CIFAR-10/100 and ImageNette with an 18-hour GPU budget on an RTX 4090, demonstrating a low-budget, reproducible NAS workflow, including specific gains like CIFAR-10: DeepSeek-Coder-6.7B from 28.2% to 69.2%, Qwen2.5-7B from 50.0% to 71.5%, and GLM-5 from 43.2% to 62.0%.

Abstract

Neural Architecture Search (NAS) automates network design, but conventional methods demand substantial computational resources. We propose a closed-loop pipeline leveraging large language models (LLMs) to iteratively generate, evaluate, and refine convolutional neural network architectures for image classification on a single consumer-grade GPU without LLM fine-tuning. Central to our approach is a historical feedback memory inspired by Markov chains: a sliding window of K{=}5 recent improvement attempts keeps context size constant while providing sufficient signal for iterative learning. Unlike prior LLM optimizers that discard failure trajectories, each history entry is a structured diagnostic triple -- recording the identified problem, suggested modification, and resulting outcome -- treating code execution failures as first-class learning signals. A dual-LLM specialization reduces per-call cognitive load: a Code Generator produces executable PyTorch architectures while a Prompt Improver handles diagnostic reasoning. Since both the LLM and architecture training share limited VRAM, the search implicitly favors compact, hardware-efficient models suited to edge deployment. We evaluate three frozen instruction-tuned LLMs ({\leq}7B parameters) across up to 2000 iterations in an unconstrained open code space, using one-epoch proxy accuracy on CIFAR-10, CIFAR-100, and ImageNette as a fast ranking signal. On CIFAR-10, DeepSeek-Coder-6.7B improves from 28.2% to 69.2%, Qwen2.5-7B from 50.0% to 71.5%, and GLM-5 from 43.2% to 62.0%. A full 2000-iteration search completes in {\approx}18 GPU hours on a single RTX~4090, establishing a low-budget, reproducible, and hardware-aware paradigm for LLM-driven NAS without cloud infrastructure.