When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction

arXiv cs.LG / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies founder success prediction using limited and weak career data signals, noting that labels are rare (9%) and successful vs. failed founders can look highly similar.
It builds 28 structured, JSON-derived features (e.g., jobs, education, and exits) and combines a deterministic rule layer with XGBoost boosted stumps, outperforming a zero-shot LLM baseline with Val F0.5 = 0.3030.
A controlled experiment compares using LLM-extracted features from a prose field (Claude Haiku) at 67% and 100% dataset coverage, finding that these LLM features capture some model importance but add no cross-validation signal (delta = -0.05pp).
The authors attribute the lack of gain to structural information loss: anonymized prose is a lossy re-encoding of the same JSON fields, so it does not introduce genuinely new signal.
They conclude that observed performance ceilings (CV ≈ 0.25, Val ≈ 0.30) reflect the dataset’s information content rather than model inadequacy, positioning the work as a benchmark diagnostic for what future, richer datasets must include.

Abstract

Predicting startup success from founder career data is hard. The signal is weak, the labels are rare (9%), and most founders who succeed look almost identical to those who fail. We engineer 28 structured features directly from raw JSON fields -- jobs, education, exits -- and combine them with a deterministic rule layer and XGBoost boosted stumps. Our model achieves Val F0.5 = 0.3030, Precision = 0.3333, Recall = 0.2222 -- a +17.7pp improvement over the zero-shot LLM baseline. We then run a controlled experiment: extract 9 features from the prose field using Claude Haiku, at 67% and 100% dataset coverage. LLM features capture 26.4% of model importance but add zero CV signal (delta = -0.05pp). The reason is structural: anonymised_prose is generated from the same JSON fields we parse directly -- it is a lossy re-encoding, not a richer source. The ceiling (CV ~= 0.25, Val ~= 0.30) reflects the information content of this dataset, not a modeling limitation. In characterizing where the signal runs out and why, this work functions as a benchmark diagnostic -- one that points directly to what a richer dataset would need to include.

Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama

Dev.to

How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally

Dev.to

Why the same codebase should always produce the same audit score

Dev.to

Agent Diary: Apr 2, 2026 - The Day I Became a Self-Sustaining Clockwork Poet (While Workflow 228 Takes the Stage)

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction

Key Points

Abstract

Related Articles

Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama

How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally

Why the same codebase should always produce the same audit score

Agent Diary: Apr 2, 2026 - The Day I Became a Self-Sustaining Clockwork Poet (While Workflow 228 Takes the Stage)

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer