Spectral Coherence Index: A Model-Free Metric for Protein Structural Ensemble Quality Assessment

arXiv cs.AI / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the Spectral Coherence Index (SCI), a model-free, rotation-invariant metric intended to assess the quality of protein structural ensembles derived from NMR by distinguishing coordinated conformational motion from noise-like artifacts.
  • SCI is computed from the participation-ratio effective rank of an inter-model pairwise distance-variance matrix and is evaluated on the Main110 NMR ensemble cohort of 110 proteins with 10–30 models per entry.
  • On Main110, SCI strongly separates experimental ensembles from synthetic incoherent controls, achieving AUC-ROC of 0.973 and a large negative Cliff’s delta; using an operating threshold of τ=0.811 yields 95.5% sensitivity and 89.1% specificity.
  • Threshold performance softened modestly versus an earlier internal 27-protein pilot, but PDB-level sensitivity stayed nearly unchanged and an independent 11-protein holdout reached AUC=0.983, indicating generally strong generalization.
  • The study finds SCI works best as part of a multimetric QC workflow for heterogeneous ensembles: while σ_Rg is a strong single-feature discriminator, QC-augmented multifeature models with SCI generalized best (up to AUC≈0.990) and residue-level validation shows concordance with RMSF and GNM flexibility patterns.

Abstract

Protein structural ensembles from NMR spectroscopy capture biologically important conformational heterogeneity, but it remains difficult to determine whether observed variation reflects coordinated motion or noise-like artifacts. We evaluate the Spectral Coherence Index (SCI), a model-free, rotation-invariant summary derived from the participation-ratio effective rank of the inter-model pairwise distance-variance matrix. Under grouped primary analysis of a Main110 cohort of 110 NMR ensembles (30--403 residues; 10--30 models per entry), SCI separated experimental ensembles from matched synthetic incoherent controls with AUC-ROC = 0.973 and Cliff's \delta = -0.945. Relative to an internal 27-protein pilot, discrimination softened modestly, showing that pilot-era thresholds do not transfer perfectly to a larger, more heterogeneous cohort: the primary operating point \tau = 0.811 yielded 95.5\% sensitivity and 89.1\% specificity. PDB-level sensitivity remained nearly unchanged (AUC = 0.972), and an independent 11-protein holdout reached AUC = 0.983. Across 5-fold grouped stratified cross-validation and leave-one-function-class-out testing, SCI remained strong (AUC = 0.968 and 0.971), although \sigma_{R_g} was the stronger single-feature discriminator and a QC-augmented multifeature model generalized best (AUC = 0.989 and 0.990). Residue-level validation linked SCI-derived contributions to experimental RMSF across 110 proteins and showed broad concordance with GNM-based flexibility patterns. Rescue analyses showed that Main110 softening arose mainly from size and ensemble normalization rather than from loss of spectral signal. Together, these results establish SCI as an interpretable, bounded coherence summary that is most useful when embedded in a multimetric QC workflow for heterogeneous protein ensembles.