Falcon Perception

Hugging Face Blog / 4/1/2026

💬 OpinionIdeas & Deep Analysis

Key Points

  • 「Falcon Perception」は2026年4月1日に公開された記事で、テーマとして「Falcon Perception」が扱われています。
  • ただし提示された本文はページのHTML構造(見出し、公開日、タグ等)中心で、内容(技術的説明や主張)の本文テキストが含まれていないため要点を特定できません。
  • 記事ページ上には「Team」のラベルがあり、組織チームによるArticleであることが示されています。
  • 現状の情報だけでは、AI技術の具体(モデル、手法、性能、ユースケース)や業界への影響を判断できません。

Falcon Perception

Team Article Published April 1, 2026

Falcon Logo

TL;DRFalcon Perception is a 0.6B-parameter early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes image patches + text in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On SA-Co, Falcon Perception reaches 68.0 Macro-F1 (vs. 62.3 for SAM 3) with the main remaining gap being presence calibration (MCC 0.64 vs. 0.82). We also introduce PBench, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes.

We also relase Falcon OCR, a 0.3B-parameter model which reaches a score of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the highest throughput of any open source OCR model.

This post is a brief, practical write-up of what we built, why we built it this way, and what we learned along the way.

Tech Report   GitHub   Playground

   PBench   OCR Model   Perception Model


The problem: why do perception systems end up as pipelines?

Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and additional components handle matching and post-processing. This family of designs works well in many settings, but it comes with trade-offs: it can be hard to scale cleanly, hard to attribute improvements to the right component, and easy to accumulate complexity as we add a new fix for each failure mode.

We asked a simpler question: can a single early-fusion Transformer backbone handle both perception and language modeling, if we choose the right attention pattern, output interface, and training signal?

In our experiments, the answer is largely yes. The rest of this post describes the main design choices and the evidence behind them.


The architecture: early fusion, hybrid attention, and an efficient dense interface

falcon_inference

A single autoregressive Transformer processes a unified sequence of image patches, text, and task tokens. The model predicts object properties in a fixed order: <coord><size><seg>. Bounding box coordinates and sizes are decoded via specialized heads and re-injected as Fourier features. High-resolution segmentation masks are generated by a dot product between the <seg> token and upsampled image features.

One Backbone, Two Behaviors

At its core, Falcon Perception is a dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer. Instead of a separate vision backbone followed by a late-fusion decoder, we keep a single backbone and rely on masking and a lightweight output interface to make the dense prediction problem tractable.

Images and text have different structure: pixels are 2D and benefit from bidirectional context, while the prediction interface is naturally sequential. We address this with a hybrid attention mask:

  • Image tokens attend to all other image tokens bidirectionally, building a global visual context (like a vision encoder would).
  • Text and task tokens attend causally to everything before them — the full visual prefix plus preceding text.

This allows the same backbone to behave like a bidirectional visual encoder on image tokens, while still supporting autoregressive prediction over task tokens.

Chain-of-Perception: coarse-to-fine supervision for dense outputs

Dense perception is not a fixed-size prediction problem: an image may contain zero instances or hundreds. Autoregressive generation gives a clean variable-length interface, but fully autoregressive dense generation (e.g., polygons or high-resolution masks token-by-token) quickly becomes expensive.

We use a small structured interface, Chain-of-Perception, which decomposes each instance into three steps:

<coord> → <size> → <seg>
  1. Coordinate token: The model first predicts the center of the instance — resolving which object it's talking about.
  2. Size token: Then the spatial extent — resolving how big it is.
  3. Segmentation token: Finally, a single embedding that, when dot-producted with upsampled image features, produces a full-resolution binary mask.

This ordering is deliberate. Committing to geometry first reduces ambiguity (“which instance?”), and makes the mask prediction step closer to pixel refinement conditioned on the resolved object.

Specialized Heads, Minimal Overhead

The backbone is shared, while decoding uses lightweight heads tailored to the output type:

  • Coordinate & Size Heads use Fourier feature encoding : mapping continuous coordinates through a random Gaussian projection into a high-dimensional sinusoidal space. This overcomes the spectral bias of neural networks, yielding more precise localization than discrete binning alone. Decoded coordinates are re-injected into the sequence as conditioning for subsequent tokens.

  • Segmentation Head computes a dot product between the <seg> token’s hidden state and content-aware upsampled image features. Because the <seg> token is produced after geometry and has access to early-fused visual context, we can avoid the separate mask-query machinery and Hungarian matching that often appears in decoder-based instance segmentation training.


PBench: a benchmark designed to isolate what is missing

Existing referring-expression benchmarks like RefCOCO are saturated — models routinely hit 90%+ — and they conflate what went wrong. Did the model fail because it can't read text? Can't understand spatial relationships? Can't handle a crowd?

We introduce PBench, a diagnostic benchmark that separates samples by the dominant capability required:

Level Capability Example Prompt
L0 Simple objects "car"
L1 Attributes & subtypes "red car", "broken fence"
L2 OCR-guided identification "Diet Coke bottle", "Nike shoes"
L3 Spatial understanding "car on the left", "third window from left"
L4 Relations & interactions "person holding umbrella", "tallest building"
Dense Crowdedness stress test Hundreds of instances per image

Each sample targets one dominant capability: OCR prompts avoid spatial qualifiers, and spatial prompts avoid in-image text disambiguators. This yields a capability profile rather than a single opaque score, and makes it easier to decide where to invest next (data, training curriculum, or post-training).


Training: distillation, large-scale data, and a three-stage recipe

Multi-Teacher Distillation

Rather than training from random weights (which in our ablations was unstable for segmentation), Falcon Perception initializes via multi-teacher distillation. Two strong vision teachers contribute complementary signals:

  • DINOv3 (ViT-H): strong local features critical for segmentation
  • SigLIP2: language-aligned features for open-vocabulary understanding

The distilled initialization achieves 74.25% zero-shot accuracy on ImageNet-1k and 85.11% linear-probe mIoU on Pascal VOC, providing a strong visual foundation before perception-specific training.

Data: 54M Images, 195M Positive Expressions, 488M Hard Negatives

We build the training set through a multi-stage pipeline:

  1. Hierarchical clustering of web-scraped images via DINOv3 embeddings to ensure uniform concept coverage.
  2. VLM-driven listing generates dense object descriptions per image, categorized by PBench complexity level (60% basic, 40% advanced).
  3. Negative mining produces semantic, visual, and fine-grained hard negatives to combat hallucination.
  4. Ensemble consensus — SAM 3, Qwen3-VL-30B, and Moondream3 must agree (IoU > 0.8) for automatic acceptance.
  5. Human verification — disagreements go to annotators, recovering hard samples that confuse automated systems.

We maintain a strict 1:1 ratio of positive to negative samples. This makes presence calibration a first-class target: the model should reliably say “absent,” not only draw masks when confident.

The Three Stages (700 GT Total)

Stage 1 — In-Context Listing (450 GT): The model learns to autoregressively list scene inventories — predicting text expressions and their locations. Full causal attention between queries enables learning of object co-occurrence ("fork, then knife, then plate"). This builds broad scene understanding.

Stage 2 — Task Alignment (225 GT): The attention mask is modified so queries can no longer see each other, simulating independent queries at inference time. Loss on text tokens is masked, focusing gradient signal entirely on presence classification and localization. This stage transitions from "scene understanding" to "answer this specific question."

Stage 3 — Long-Context Finetuning (10 GT): A short phase with the mask limit raised to 600 per expression and a minimal constant learning rate. This adapts the model for extreme crowd density without forgetting earlier capabilities.

Key design choices validated through ablations:

  • Muon optimizer for the specialized heads (vs. AdamW) — yields +4.8 points on SA-Co detection
  • Raster ordering of instances (vs. random/size) — +10 points over random ordering on SA-Co
  • Gram feature regularization — prevents drift from the distillation features, improving segmentation by +1.5 points
  • Global loss normalization across ranks — corrects bias from variable-length packed sequences in FSDP

Results

SA-Co: Best-in-Class Mask Quality

On the SA-Co open-vocabulary segmentation benchmark, Falcon Perception (0.6B parameters) achieves 68.0 Macro-F1, compared to 62.3 for SAM 3, with large gains on attribute-heavy (+8.2), food & drink (+12.2), and sports equipment (+4.0) splits. At the same time, Falcon Perception lags SAM 3 on presence calibration (MCC: 0.64 vs 0.82), which is the clearest remaining improvement axis.

Here's an example output — the prompt "Falcon" produces precise instance masks:

falcon_demo

Falcon Perception also performs well for reffering expressions, able to correctly segment the burger with a black bun in each frame of the video:

burger_output

PBench: Scaling with Prompt Complexity

This is where the early-fusion design shows the largest differences:

Capability SAM 3 Falcon Perception Gap
L0: Simple objects 64.3 65.1 +0.8
L1: Attributes 54.4 63.6 +9.2
L2: OCR-guided 24.6 38.0 +13.4
L3: Spatial 31.6 53.5 +21.9
L4: Relations 33.3 49.1 +15.8
Dense 58.4 72.6 +14.2

On simple objects, the gap is modest. As prompts become more compositional—requiring OCR-guided disambiguation, spatial constraints, or relational binding—the gap widens.

In our PBench Dense split, Falcon Perception (0.6B) substantially outperforms generalist VLM baselines (e.g., 72.6 vs 8.9 for Qwen3-VL-30B in our evaluation setup), and matches or exceeds the 8B model on spatial and relational tiers.

Qualitative Results: OCR, Spatial, Relational, and Dense

As prompts grow more compositional — requiring OCR-guided disambiguation, spatial constraints, relational binding, or scaling to hundreds of instances — the early-fusion advantage becomes visually clear:

  • OCR-Guided Grounding (Level 2): When the distinguishing signal is text written on an object, Falcon Perception reads it correctly while SAM 3 cannot differentiate.
  • Spatial Understanding (Level 3): When prompts specify spatial relationships, Falcon Perception forms a coherent 2D scene map.
  • Relational Reasoning (Level 4): When the target is defined through interactions rather than appearance, Falcon Perception understands the scene graph.
  • Dense Scenes: Scaling to Hundreds of Instances: The autoregressive interface is particularly useful when scenes are extremely crowded, where fixed-query decoders can run into practical limits.
Level 2 — OCR-Guided Grounding: Falcon Perception reads text on objects to disambiguate; SAM 3 cannot.

Level 2: OCR-guided identification — Falcon Perception vs SAM 3

"168 wine bottles": Falcon Perception identifies the bottles labeled "168", while SAM 3 highlights every bottle. "Honolulu direction sign": Falcon reads the text to find the right sign.

Level 3 — Spatial Understanding: Falcon Perception resolves spatial constraints; SAM 3 returns false positives.

Level 3: Spatial understanding — Falcon Perception vs SAM 3

"Lower meat skewer on left grill," "black car to the right of red car at bottom," "Belgian flag on the left" — Falcon Perception resolves the correct instance from spatial constraints. SAM 3 predicts false positives for multiple candidates.

Level 4 — Relational Reasoning: Falcon Perception understands interactions; SAM 3 ignores relational constraints.

Level 4: Relational reasoning — Falcon Perception vs SAM 3

"Pastry next to brown round bread," "person using phone," "person holding helmet in hand" — Falcon Perception identifies the interacting instance. SAM 3 highlights all instances of the object class, ignoring the relational constraint.

Dense Scenes: Falcon Perception scales to hundreds of instances; SAM 3's decoder runs out of query tokens.

Dense split: Falcon Perception scales to hundreds of instances

"Snow goose," "pigeon," "colorful canned drinks" — Falcon Perception autoregressively segments hundreds of instances. SAM 3's fixed-size decoder runs out of query tokens beyond ~200 instances.


Falcon OCR: extending early fusion to document understanding

Modern OCR has moved well beyond extracting text from clean scans. Today's systems must handle multi-column layouts, mathematical formulas, tables, charts, and multilingual content — all in one pass. Most competitive OCR VLMs tackle this with a familiar recipe: a vision encoder feeding a separate text decoder, plus task-specific glue. These systems work, but they tend to be large (1B–3B+ parameters).

We took a different path: reuse the same early-fusion dense Transformer from Falcon Perception, but train a smaller 0.3B-parameter variant from scratch specifically for OCR. The result is Falcon OCR — a single backbone that processes image patches and text tokens in a shared parameter space with the same hybrid attention mask (bidirectional for image tokens, causal for text tokens), and switches tasks through prompts rather than additional modules.

We trained from scratch (no multi-teacher distillation) because the visual features OCR needs — fine-grained glyph recognition, stroke-level discrimination — differ substantially from the object-level features useful for segmentation. Starting fresh lets the backbone develop text-optimized representations from the ground up.

Training

We train on a curated English-language mixture spanning three core tasks: general document text parsing (digital PDFs, old scans, typewritten documents), mathematical and scientific formula recognition, and table structure recognition. The mixture also includes handwriting, real-world scene text, and synthetic samples generated from rendered LaTeX and HTML sources. The training objective is pure next-token prediction on structured text outputs.

Training proceeds in two phases: a long pre-training phase at constant learning rate where the model learns core OCR capabilities across all element types, followed by a short cosine-decay finetuning phase where the learning rate is annealed to near zero.

Benchmark results

We evaluate on olmOCR (binary correctness checks across diverse inputs) and OmniDocBench (continuous metrics over full-page parses). All comparison models are significantly larger and/or use proprietary infrastructure. At 80.3% on olmOCR with only 0.3B parameters, Falcon OCR is within 1.7 points of the top system and leads all models on Multi-Column (87.1%) and Tables (90.3%). On OmniDocBench it scores 88.64 overall, ahead of DeepSeek OCR v2, GPT 5.2, and Mistral OCR 3.

Serving throughput

At 0.3B parameters, Falcon OCR is roughly 3x smaller than 0.9B-class OCR VLMs, which translates directly into higher serving throughput. Measured on a single A100-80GB with vLLM at high concurrency:

Mode tok/s img/s Description
Layout + OCR 5,825 2.9 Full pipeline: layout detection → crop → per-region OCR

The compact footprint and vLLM integration (continuous batching, PagedAttention, optimized CUDA kernels) make it practical for large-scale document digitization where millions of pages need processing.

What we see in the results

More broadly, these results suggest that the early-fusion single-stack Transformer is a viable alternative to the "vision encoder plus text decoder" recipe for OCR. One backbone, shared parameter space, one decoding interface, and better data and training signals rather than increasingly complex pipelines. We hope this encourages more work in this direction.

Qualitative examples

Falcon OCR processes images captured under challenging real-world conditions with varying lighting, diverse text semantics (mathematical formulae, structured tables, handwritten notes), and complex document layouts, to produce structured text output.

Click each category below to expand.

Handwriting and Real-world Images: Accurate transcription of handwritten text and in-the-wild captures under adverse conditions.

Falcon OCR: handwriting and real-world image transcription

Falcon OCR extracts text from handwritten documents and real-world photographs with variable lighting, orientation, and content complexity.

Table Extraction: Faithful reproduction of tabular structure and cell content across diverse formats.

Falcon OCR: table extraction from documents

Falcon OCR accurately reproduces cell entries and structural layout from tables of varying formats and complexity.

Mathematical Formulae: Accurate recognition of equations across varying levels of symbolic complexity.

Falcon OCR: mathematical formula recognition

Falcon OCR correctly transcribes mathematical expressions ranging from simple equations to multi-line derivations with nested operators.

Complex Document Layouts: Faithful text extraction from multi-column, mixed-content documents.

Falcon OCR: complex document layout extraction

Falcon OCR preserves reading order and structural fidelity when extracting text from documents with multi-column layouts, figures, and footnotes.


Inference: Fast, Practical, and Open

The release includes an inference stack built on PyTorch’s FlexAttention, which makes it practical to express the custom attention patterns and efficiently serve packed variable-length sequences.

Paged Inference Engine

  • Paged KV cache with virtual page tables (no wasted memory from padding)
  • Continuous batching: new sequences enter mid-generation, finished ones release pages immediately
  • CUDA graph capture for the decode loop
  • Background tokenization overlapped with GPU compute
  • HR feature cache: LRU cache with pinned-memory buffers for async GPU-CPU transfer of upsampled image features — subsequent queries on the same image skip the expensive upsampling step

In our setup on an H100, typical latencies are on the order of ~100ms prefill, ~200ms upsampling (0ms if cached), and ~50ms decode for a handful of instances. (These numbers depend on resolution, sequence length, and the number of predicted instances.)

Docker and MLX Integration for Falcon-OCR

For the Falcon-OCR model, we also provide a vLLM docker server for fast deployment and MLX integration for Apple-Silicon

Please check out the github repo for details.


The Bigger Picture: A "Bitter Lesson" for Perception

Falcon Perception is intentionally minimal: one backbone, one objective family, and small heads only where outputs are continuous and dense. The working assumption is that most gains should come from data, compute, and training signals, rather than continually expanding the pipeline with specialized modules.

The architecture doesn't block any obvious scaling path: add more images and harder prompts for better grounding, mix in text-only data for better language, increase context length for denser scenes. It's still just one sequence model.

Falcon Perception is developed by the Falcon Vision Team at the Technology Innovation Institute (TII), Abu Dhabi, UAE.

Citation

If you use Falcon-Perception, please cite

@article{bevli2026falcon,
  title   = {Falcon Perception},
  author  = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit},
  journal = {arXiv preprint arXiv:2603.27365},
  year    = {2026},
  url     = {https://arxiv.org/abs/2603.27365}
}

Mentioned models

Mentioned datasets

Community

EditPreview
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Comment

· Sign up or log in to comment

Mentioned models

Mentioned datasets