When Career Data Runs Out: Structured Feature Engineering and Signal Limits for Founder Success Prediction

arXiv cs.LG / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies founder success prediction using limited and weak career data signals, noting that labels are rare (9%) and successful vs. failed founders can look highly similar.
  • It builds 28 structured, JSON-derived features (e.g., jobs, education, and exits) and combines a deterministic rule layer with XGBoost boosted stumps, outperforming a zero-shot LLM baseline with Val F0.5 = 0.3030.
  • A controlled experiment compares using LLM-extracted features from a prose field (Claude Haiku) at 67% and 100% dataset coverage, finding that these LLM features capture some model importance but add no cross-validation signal (delta = -0.05pp).
  • The authors attribute the lack of gain to structural information loss: anonymized prose is a lossy re-encoding of the same JSON fields, so it does not introduce genuinely new signal.
  • They conclude that observed performance ceilings (CV ≈ 0.25, Val ≈ 0.30) reflect the dataset’s information content rather than model inadequacy, positioning the work as a benchmark diagnostic for what future, richer datasets must include.

Abstract

Predicting startup success from founder career data is hard. The signal is weak, the labels are rare (9%), and most founders who succeed look almost identical to those who fail. We engineer 28 structured features directly from raw JSON fields -- jobs, education, exits -- and combine them with a deterministic rule layer and XGBoost boosted stumps. Our model achieves Val F0.5 = 0.3030, Precision = 0.3333, Recall = 0.2222 -- a +17.7pp improvement over the zero-shot LLM baseline. We then run a controlled experiment: extract 9 features from the prose field using Claude Haiku, at 67% and 100% dataset coverage. LLM features capture 26.4% of model importance but add zero CV signal (delta = -0.05pp). The reason is structural: anonymised_prose is generated from the same JSON fields we parse directly -- it is a lossy re-encoding, not a richer source. The ceiling (CV ~= 0.25, Val ~= 0.30) reflects the information content of this dataset, not a modeling limitation. In characterizing where the signal runs out and why, this work functions as a benchmark diagnostic -- one that points directly to what a richer dataset would need to include.