Known Intents, New Combinations: Clause-Factorized Decoding for Compositional Multi-Intent Detection

arXiv cs.CL / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper targets a tougher multi-intent detection setting: recognizing new combinations of known intents rather than only repeating familiar co-occurrence patterns from training data.
  • It introduces the CoMIX-Shift benchmark to measure compositional generalization using held-out intent pairs, discourse/pattern shifts, longer/noisier wrappers, held-out clause templates, and zero-shot intent triples.
  • It proposes ClauseCompose, a lightweight decoding approach trained only on singleton intents, and shows strong exact-match performance across multiple compositional stress tests.
  • In head-to-head comparisons, ClauseCompose substantially outperforms whole-utterance baselines (WholeMultiLabel and a fine-tuned tiny BERT) especially on held-out intent pairs and template/connector shift scenarios.
  • The authors conclude that multi-intent detection research and evaluation should include more compositional tests, where simple factorized decoding can be surprisingly effective.

Abstract

Multi-intent detection papers usually ask whether a model can recover multiple intents from one utterance. We ask a harder and, for deployment, more useful question: can it recover new combinations of familiar intents? Existing benchmarks only weakly test this, because train and test often share the same broad co-occurrence patterns. We introduce CoMIX-Shift, a controlled benchmark built to stress compositional generalization in multi-intent detection through held-out intent pairs, discourse-pattern shift, longer and noisier wrappers, held-out clause templates, and zero-shot triples. We also present ClauseCompose, a lightweight decoder trained only on singleton intents, and compare it to whole-utterance baselines including a fine-tuned tiny BERT model. Across three random seeds, ClauseCompose reaches 95.7 exact match on unseen intent pairs, 93.9 on discourse-shifted pairs, 62.5 on longer/noisier pairs, 49.8 on held-out templates, and 91.1 on unseen triples. WholeMultiLabel reaches 81.4, 55.7, 18.8, 15.5, and 0.0; the BERT baseline reaches 91.5, 77.6, 48.9, 11.0, and 0.0. We also add a 240-example manually authored SNIPS-style compositional set with five held-out pairs; there, ClauseCompose reaches 97.5 exact match on unseen pairs and 86.7 under connector shift, compared with 41.3 and 10.4 for WholeMultiLabel. The results suggest that multi-intent detection needs more compositional evaluation, and that simple factorization goes surprisingly far once evaluation asks for it.