NEO-unify — A 2B multimodal model with no Vision Encoder, no VAE. Open source coming "hopefully not too long"

Reddit r/LocalLLaMA / 4/14/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

SenseTime（中国のAIラボ）が、ビジョンエンコーダとVAEを使わずに「生のピクセル入出力」を行う2B規模のマルチモーダルモデルNEO-unifyの詳細を公開したと報じられています。
モデルは単一のTransformerバックボーン（MoT: Mixture of Transformer）で、テキスト理解と画像生成を同一モデルで扱い、画像生成はflow matching、テキストは自己回帰で学習したとされています。
画像再構成品質（PSNR 31.56）が少ない事前学習ステップ（90K）で既存のVAEベース手法（例: Fluxの32.65）に近い性能を示した、またデータ効率でも既存モデル（Bagel）を上回るといった数値が紹介されています。
画像編集が「理解側ブランチを凍結しても」機能する可能性が示され、さらにエンコーダ依存が少ないためローカル実行のハードルが下がる点も注目ポイントとして挙げられています。
ただし現時点では未リリースで、オープンソース公開と詳細な技術レポートは「近日（not too long）を期待」とされ、HFページでの更新が案内されています。

NEO-unify — A 2B multimodal model with no Vision Encoder, no VAE. Open source coming "hopefully not too long"

SenseTime (the Chinese AI lab) just published details on NEO-unify, a multimodal model that throws out the vision encoder AND the VAE. Just raw pixels in, raw pixels out.

The quick rundown:

No CLIP, no SigLIP, no VAE — it processes pixel inputs natively
2B parameter model, single unified Transformer backbone (they call it MoT — Mixture of Transformer) handles both understanding and image generation
Trained with flow matching for image generation, autoregressive for text — all in one model

Numbers that caught my attention:

Image reconstruction quality (PSNR 31.56) is already close to Flux's VAE (32.65) at only 90K pretraining steps
Beats Bagel on data efficiency (same benchmark, fewer tokens)
Image editing works even with the understanding branch completely frozen

The bad news: Not released yet. The comment from a team member says they're "actively preparing for open source as well as a detailed tech report."

For a 2B model with no encoder dependencies, this could be interesting to run locally — lighter dependency stack than most multimodal setups.

Keeping an eye on their HF page: https://huggingface.co/blog/sensenova/neo-unify

Got the Discord server invation code: https://discord.gg/vh5SE45D8b

Anyone else tracking encoder-free multimodal models? Feels like this direction (Chameleon, Vila-U, now NEO-unify) is picking up steam.

submitted by /u/Few-Personality6088
[link] [comments]