| Paper: https://arxiv.org/abs/2603.27538 Code: https://github.com/meituan-longcat/LongCat-Next Blog: https://longcat.chat/longcat-next/intro Model: https://huggingface.co/meituan-longcat/LongCat-Next MIT License: https://huggingface.co/meituan-longcat/LongCat-Next/blob/main/LICENSE Abstract
[link] [comments] |
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
Reddit r/LocalLLaMA / 3/31/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes DiNA (Discrete Native Autoregressive), a unified framework that represents multimodal inputs in a shared discrete token space to enable consistent autoregressive modeling across text, vision, and audio.
- It introduces dNaViT, a “discrete native any-resolution” visual tokenizer/decoder that converts continuous images into hierarchical discrete tokens at arbitrary resolutions.
- Based on this approach, the authors develop LongCat-Next, claiming strong “see, paint, and talk” performance by using a single autoregressive objective with minimal modality-specific engineering.
- The work targets the known limitations of discrete vision modeling on understanding tasks and frames LongCat-Next as a way to reconcile understanding vs. generation in a unified multimodal model.
- The authors open-source the LongCat-Next model and tokenizers, aiming to accelerate further research and development in native multimodality.



