AI Navigate

RFX-Fuse: Breiman and Cutler's Unified ML Engine + Native Explainable Similarity

arXiv cs.LG / 3/17/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • RFX-Fuse is presented as Breiman and Cutler's unified random forest engine that supports classification, regression, unsupervised learning, proximity-based similarity, outlier detection, missing value imputation, and visualization within a single model object.
  • It offers native explainable similarity through proximity-based measures, introducing Proximity Importance to explain why samples are considered similar.
  • It introduces dataset-specific imputation validation that ranks imputation methods by how realistic the imputed data appears, without ground-truth labels.
  • The engine provides native GPU/CPU support and aims to replace multiple separate tools (e.g., XGBoost, FAISS, SHAP, Isolation Forest) with one unified framework.
  • The work is framed as reviving Breiman and Cutler's original vision of a unified ML engine, contrasting with current libraries that split functionality across many tools.

Abstract

Breiman and Cutler's original Random Forest was designed as a unified ML engine -- not merely an ensemble predictor. Their implementation included classification, regression, unsupervised learning, proximity-based similarity, outlier detection, missing value imputation, and visualization -- capabilities that modern libraries like scikit-learn never implemented. RFX-Fuse (Random Forests X [X=compression] -- Forest Unified Learning and Similarity Engine) delivers Breiman and Cutler's complete vision with native GPU/CPU support. Modern ML pipelines require 5+ separate tools -- XGBoost for prediction, FAISS for similarity, SHAP for explanations, Isolation Forest for outliers, custom code for importance. RFX-Fuse provides a 1 to 2 model object alternative -- a single set of trees grown once. Novel Contributions: (1) Proximity Importance -- native explainable similarity: proximity measures that samples are similar; proximity importance explains why. (2) Dataset-specific imputation validation for general tabular data -- ranking imputation methods by how real the imputed data looks, without ground truth labels.