Fine-tuning YOLOv11 to detect stamps and signatures on banking documents - a practical walkthrough

Dev.to / 4/30/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

The article provides a practical, production-oriented walkthrough for fine-tuning YOLOv11 to detect stamps and signatures on real banking documents, where scanned, photographed, and watermarked inputs often break models trained on natural images.
It argues that YOLOv11 is a better fit than layout-aware models (e.g., LayoutLMv3/Donut) and classical OpenCV methods, citing YOLO’s speed, tunable precision/recall, and improved handling of small objects common in low-resolution scans.
The author emphasizes that the hardest work is data preparation and annotation rather than model architecture, detailing tooling options (Roboflow for labeling/export or CVAT as an open-source alternative) and disciplined class selection.
It recommends starting with a minimal class taxonomy (e.g., signature and stamp, optionally handwritten initials) to reduce labeling burden and debugging complexity, with the ability to expand classes later.
Overall, the post frames the “playbook” as bridging the gap between online YOLO tutorials and the engineering needed for regulated banking environments, including practical deployment constraints such as inference latency.

Every day, banking ops teams manually review thousands of documents -
loan applications, KYC forms, contracts - looking for the right stamps,
the right signatures, in the right places. It's slow, expensive, and
exactly the kind of work computer vision was made to automate.
The catch is that most YOLO tutorials online teach you to detect cars,
dogs, or people in natural photos. None of that translates cleanly to
documents. Documents are structured, scanned at varying quality, often
photographed on phones at angles, sometimes faxed, frequently watermarked, and almost never lit consistently. The model that detects stamps on a
clean PDF will collapse on a phone-shot photo of the same form.
"Over the past few weeks I've been deep in shipping a YOLOv11-based
detector for stamps and signatures on documents in a regulated banking
environment."
The work taught me where the off-the-shelf tutorials end and where the
real engineering begins. Here's the playbook.

Why YOLOv11 over the alternatives

There are a few reasonable starting points for document object detection:

Layout-aware models like LayoutLMv3 or Donut - strong for structured forms, but heavier, harder to fine-tune for a narrow task, and slower at inference. Overkill if you only need to detect a small set of objects (stamps, signatures, initials). - Classical OpenCV approaches - template matching, contour detection, Hough transforms. Fast and lightweight but brittle on real-world scans. - YOLO family (v8, v11) - the sweet spot for object detection on documents. Fast, well-documented, easy to fine-tune, and the precision/recall tradeoff is tunable to ops-team requirements. I went with YOLOv11. The ultralytics Python package handles most of the busywork, inference runs well under 100ms per page on a modest GPU, and the architecture handles small objects - which stamps often are at low scan resolutions - better than older versions. ## The 80%: data preparation and annotation Anyone who's shipped CV in production will tell you the same thing: the model is the easy part. Data is where the time goes. Annotation tooling. I used Roboflow - clean web UI for bounding-box labeling, automatic train/val/test splits, easy export to YOLO format. CVAT is the open-source alternative if you can't use a SaaS for compliance reasons. Class taxonomy. Resist the urge to define ten classes on day one. Start with the smallest set that solves the business problem:
signature - stamp - (Optionally handwritten_initials if your forms include them) More classes means more labeled examples per class, more failure modes, and a harder model to debug. You can always split a class later. You can rarely merge messy ones cleanly. Train/val/test split discipline. Separate documents into the three splits by source, not just randomly. If the same form template appears in both train and val, your validation metric is lying to you - the model is learning the form layout, not the object. In a regulated environment where wrong predictions cost real money, you cannot afford a lying validation set. Augmentation strategy - and why the defaults are wrong for documents. The off-the-shelf YOLO augmentation defaults are designed for natural images. They include rotation up to 30°, mosaic, MixUp. For documents, that's actively wrong:
Rotation should be tightly limited (±5°). Documents are upright. Heavy rotation creates training examples that don't reflect production input. - Mosaic augmentation should be off. Pasting four documents into a 2×2 grid produces inputs that don't exist at inference time. - What helps instead: brightness/contrast variation (different scan qualities), JPEG compression noise (low-quality scans), partial occlusion (parts of the document obscured), Gaussian blur (out-of-focus phone shots). "The single biggest accuracy gain in my project came from augmenting for phone-photographed scans. Production data was messier than my training set assumed - closing that gap mattered more than any architecture change." ## Training configuration that actually matters Most YOLO hyperparameters are fine at defaults. The ones that move the needle on documents:

 from ultralytics import YOLO
model = YOLO('yolo11m.pt')
results = model.train(
 data='dataset.yaml',
 epochs=100,
 imgsz=1024, # higher imgsz matters for small stamps
 batch=8,
 lr0=0.001,
 patience=20, # early stopping if mAP stalls
 augment=True,
 mosaic=0.0, # off for documents
 degrees=5, # limit rotation
 fliplr=0.0, # don't horizontally flip docs
 )
 ```
{% endraw %}

Two things worth flagging:
**{% raw %}`imgsz=1024`{% endraw %} not 640.** Stamps at low resolution can become a few
 pixels - too small for the model to detect reliably. Higher input size
 costs more compute per image, but the precision gain on small objects
 is substantial.
**Disable horizontal flipping.** A flipped form is a wrong form.
 Augmentations that produce never-seen-in-production inputs hurt
 generalization on the inputs you actually care about.
## The metric you should actually optimize for
Most tutorials default to {% raw %}`mAP@0.5`{% endraw %}. For document AI in a regulated
 environment, that's the wrong primary metric.
Ops teams care about **precision**. When the model says "there's a
 signature here," they need it to be right. A false positive sends a
 document downstream that shouldn't be there, costing reviewer time. A
 false negative is recoverable - the document falls back to manual
 review, which is the existing baseline.
Track both, but if you have to optimize one, optimize precision. Your
 ops manager will thank you.
## Inference and deployment
A model that runs on a GPU is fun. A model that runs on a CPU is
 shippable. For most document-AI workloads - where you're processing on the order of dozens to hundreds of pages per minute, not millions - 
 CPU inference with an ONNX-exported model is faster to deploy, cheaper 
 to run, and far more compatible with locked-down production environments 
 where GPU drivers are a fight you don't want.
The flow is:
1. Train with {% raw %}`ultralytics` (PyTorch backend, GPU during training)
 2. Export the trained weights to ONNX
 3. Serve via `ultralytics`'s ONNX-runtime path on CPU at inference time
Step 2 is one line:


```python
 from ultralytics import YOLO
model = YOLO('best.pt')
 model.export(format='onnx') # writes best.onnx alongside best.pt
 ```


Step 3 - the inference service:


```python
 from fastapi import FastAPI, UploadFile
 from ultralytics import YOLO
 from PIL import Image
 import io
app = FastAPI()
 model = YOLO('best.onnx') # ONNX runtime, CPU-only
@app.post('/detect')
 async def detect(file: UploadFile):
 image = Image.open(io.BytesIO(await file.read()))
 results = model(image)
detections = []
 for r in results:
 for box in r.boxes:
 detections.append({
 'class': model.names[int(box.cls)],
 'confidence': float(box.conf),
 'bbox': box.xyxy.tolist()[0],
 })
return {'detections': detections}
 ```


The most important line in that snippet is `model = YOLO('best.onnx')`
 at module level - load the model **once at startup**, never per request.
 Reloading the model on every request is the most common production
 mistake I've seen on YOLO endpoints. It's the difference between 50ms
 response time and 5,000ms.
For the container: a slim Python base image (`python:3.11-slim`) is
 enough. No CUDA, no GPU drivers, no NVIDIA dependencies. The image
 ends up under 500MB, starts in seconds, and runs anywhere - including
 locked-down corporate VMs and on-prem environments where shipping a
 GPU-dependent service is months of approvals you don't have.
That's the real tradeoff: you give up a small amount of per-request
 latency in exchange for a service that deploys today, not next quarter.

## What the tutorials don't tell you
Three lessons the standard YOLO blog posts skip:
**1. The long tail of weird scans is where production breaks.** Faxed
 pages with horizontal banding, partially photocopied documents, phone
 shots with one corner cut off, watermarks bleeding through from the
 back side. Your training set won't include enough of these. Get a
 sample of real production input as fast as possible - even just 50
 images - and use them for evaluation, not training. They tell you what
 the world actually looks like.
**2. Log every prediction with the input image hash.** When the model
 fails in production, you want to be able to find the exact input that
 broke it, retroactively. Hash the input, log the prediction, store both.
 That's how you build round-2 training data without hunting.
**3. Don't chase mAP@0.95.** Diminishing returns. If your business
 needs 95% precision at 70% recall, optimize for that operating point - 
 not for a metric that summarizes the whole curve. Talk to your ops
 team. Get the actual numbers they care about. Train against those.
## Closing
The model is not the bottleneck for document AI. The bottleneck is
 annotation discipline, augmentation tuned to real production input,
 and deployment that doesn't blow up under load. If you're building
 computer vision for regulated industries - banking, insurance, legal,
 healthcare - the playbook above is what's worked for me. The frameworks
 change. The data discipline doesn't.