AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

arXiv cs.AI / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The article introduces AgentPulse, a continuous evaluation framework that scores AI agents in deployment using 18 real-time signals aggregated across multiple platforms and registries.
  • Instead of relying only on static benchmark ability, AgentPulse evaluates agents across four factors: Benchmark Performance, Adoption Signals, Community Sentiment, and Ecosystem Health.
  • The study finds the four factors provide largely complementary information, with relatively low correlations between most pairs compared to the Adoption and Ecosystem link.
  • A circularity-controlled test shows that sub-composites that exclude GitHub-derived signals can still predict external adoption proxies such as GitHub stars, Stack Overflow activity, and (illustratively) VS Code installs.
  • The authors emphasize that AgentPulse is a methodology (not a definitive ground-truth ranking) and release the framework, collected data, scoring outputs, and evaluation harness under CC BY 4.0.

Abstract

Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framework scoring 50 agents across 10 workload categories along four factors (Benchmark Performance, Adoption Signals, Community Sentiment, and Ecosystem Health) aggregated from 18 real-time signals across GitHub, package registries, IDE marketplaces, social platforms, and benchmark leaderboards. Three analyses ground the framework. The four factors capture largely complementary information (n=50; \rho_{\max}=0.61 for Adoption-Ecosystem, all others |\rho| \leq 0.37). A circularity-controlled test (n=35) shows the Benchmark+Sentiment sub-composite, which contains no GitHub-derived signals, predicts external adoption proxies it does not aggregate: GitHub stars (\rho_s=0.52, p<0.01) and Stack Overflow question volume (\rho_s=0.49, p<0.01), with VS Code installs (\rho_s=0.44, p<0.05) reported as illustrative given that only 11 of 35 agents have non-zero installs. On the n=11 subset with published SWE-bench scores, composite and benchmark-only rankings are nearly uncorrelated (\rho_s=0.25; 9 of 11 agents shift by at least 2 ranks), driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset. This is precisely why we rest the framework's validity claim on the broader n=35 test rather than the SWE-bench overlap. AgentPulse surfaces deployment signal absent from benchmarks; it is a methodology, not a ground-truth ranking. The framework, all collected signals, scoring outputs, and evaluation harness are released under CC BY 4.0.