MetaKube: An Experience-Aware LLM Framework for Kubernetes Failure Diagnosis

arXiv cs.LG / 3/26/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • MetaKube is introduced as an experience-aware LLM framework for Kubernetes failure diagnosis that learns from historical resolutions rather than relying only on static knowledge bases.
  • The system combines an Episodic Pattern Memory Network (EPMN) for confidence-calibrated retrieval, a meta-cognitive controller that switches between intuitive and analytical reasoning, and KubeLLM (a locally deployable 8B model) post-trained on a 7,000-sample Kubernetes fault resolution dataset.
  • In evaluations on 1,873 real-world scenarios, MetaKube improved Qwen3-8B scores from 50.9 to 90.5 and claims to approach GPT-4.1-like performance while preserving data privacy via local deployment.
  • Experiments indicate the episodic experiential learning component contributes a 15.3% improvement, with continuous-learning tests showing progressively better results as it accumulates operational knowledge.
  • The authors provide source code and resources publicly on GitHub for reuse and further experimentation.

Abstract

Existing LLM-based Kubernetes diagnostic systems cannot learn from operational experience, operating on static knowledge bases without improving from past resolutions. We present MetaKube, an experience-aware LLM framework through three synergistic innovations: (1) an Episodic Pattern Memory Network (EPMN) that abstracts diagnostic patterns from historical resolutions and provides confidence-calibrated retrieval for both rapid pattern matching and guided causal exploration, (2) a meta-cognitive controller that dynamically routes between intuitive and analytical pathways based on problem familiarity, optimizing the trade-off between speed and depth, and (3) KubeLLM, a locally-deployable 8B model enhanced through domain-specific post-training on our 7,000-sample Kubernetes Fault Resolution Dataset. Evaluation on 1,873 real-world scenarios demonstrates MetaKube transforms Qwen3-8B from 50.9 to 90.5 points, approaching GPT-4.1 performance while ensuring complete data privacy. EPMN contributes 15.3% improvement through experiential learning, with continuous learning experiments showing progressive gains as the system accumulates operational knowledge. The source code and related resources are available at https://github.com/MetaKube-LLM-for-Kubernetes-Diagnosis/MetaKube.