GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution

arXiv cs.CL / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces GoCoMA, a multimodal framework to attribute generated code to the specific LLM source, addressing forensic needs created by LLMs producing human-like code.
  • GoCoMA combines two complementary signals: code stylometry (structural and stylistic patterns) and image representations of binary pre-executable artifacts that capture lower-level, execution-oriented byte semantics.
  • It embeds each modality in a hyperbolic Poincaré ball and uses a geodesic-cosine similarity-based cross-modal attention (GCSA) mechanism to fuse modalities effectively.
  • The method maps the fused hyperbolic representation back to Euclidean space for final LLM-source attribution, then demonstrates consistent gains over unimodal and Euclidean multimodal baselines.
  • Experiments on the CoDET-M4 and LLMAuthorBench open-source benchmarks show GoCoMA achieves stronger performance under the same evaluation protocols.

Abstract

Large Language Models (LLMs) trained on massive code corpora are now increasingly capable of generating code that is hard to distinguish from human-written code. This raises practical concerns, including security vulnerabilities and licensing ambiguity, and also motivates a forensic question: 'Who (or which LLM) wrote this piece of code?' We present GoCoMA, a multimodal framework that models an extrinsic hierarchy between (i) code stylometry, capturing higher-level structural and stylistic signatures, and (ii) image representations of binary pre-executable artifacts (BPEA), capturing lower-level, execution-oriented byte semantics shaped by compilation and toolchains. GoCoMA projects modality embeddings into a hyperbolic Poincar\'e ball, fuses them via a geodesic-cosine similarity-based cross-modal attention (GCSA) fusion mechanism, and back-projects the fused representation to Euclidean space for final LLM-source attribution. Experiments on two open-source benchmarks (CoDET-M4 and LLMAuthorBench) show that GoCoMA consistently outperforms unimodal and Euclidean multimodal baselines under identical evaluation protocols.