Model Catalog

This document provides a comprehensive overview of all models and inference engines available in Docling, organized by processing stage.

Overview

Docling's document processing pipeline consists of multiple stages, each using specialized models and inference engines. This catalog helps you understand:

What stages are available for document processing
Which model families power each stage
What specific models you can use
Which inference engines support each model

Stages and Models Overview

The following table shows all processing stages in Docling, their model families, and available models.

Stage	Model Family	Models
Layout Document structure detection	Object Detection (RT-DETR based)	`docling-layout-heron` ⭐ `docling-layout-heron-101` `docling-layout-egret-medium` `docling-layout-egret-large` `docling-layout-egret-xlarge` `docling-layout-v2` (legacy)
		Inference Engine: Transformers, ONNXRuntime (in progress)
		Purpose: Detects document elements (paragraphs, tables, figures, headers, etc.)
		Output: Bounding boxes with element labels (TEXT, TABLE, PICTURE, SECTION_HEADER, etc.)
OCR Text recognition	Multiple OCR Engines	Auto ⭐ Tesseract (CLI or Python bindings) EasyOCR RapidOCR (ONNX, OpenVINO, PaddlePaddle) macOS Vision (native macOS) SuryaOCR
		Inference Engines: Engine-specific
		Purpose: Extracts text from images and scanned documents
Table Structure Table cell recognition	TableFormer	`TableFormer (accurate mode)` ⭐ `TableFormer (fast mode)`
		Inference Engine: docling-ibm-models
		Purpose: Recognizes table structure (rows, columns, cells) and relationships
Table Structure Table cell recognition	Object Detection	Work in progress
		Inference Engine: TBD
		Purpose: Alternative approach for table structure recognition using object detection
Picture Classifier Image type classification	Image Classifier (Vision Transformer)	`DocumentFigureClassifier-v2.0` ⭐
		Inference Engine: Transformers
		Purpose: Classifies pictures into categories (Chart, Diagram, Natural Image, etc.)
VLM Convert Full page conversion	Vision-Language Models	Granite-Docling-258M ⭐ (DocTags) SmolDocling-256M (DocTags) DeepSeek-OCR-3B (Markdown, API-only) Granite-Vision-3.3-2B (Markdown) Pixtral-12B (Markdown) GOT-OCR-2.0 (Markdown) Phi-4-Multimodal (Markdown) Qwen2.5-VL-3B (Markdown) Gemma-3-12B/27B (Markdown, MLX-only) Dolphin (Markdown)
		Inference Engines: Transformers, MLX, API (Ollama, LM Studio, OpenAI), vLLM, AUTO_INLINE
		Purpose: Converts entire document pages to structured formats (DocTags or Markdown)
		Output Formats: DocTags (structured), Markdown (human-readable)
Picture Description Image captioning	Vision-Language Models	SmolVLM-256M ⭐ Granite-Vision-3.3-2B Pixtral-12B Qwen2.5-VL-3B
		Inference Engines: Transformers, MLX, API (Ollama, LM Studio), vLLM, AUTO_INLINE
		Purpose: Generates natural language descriptions of images and figures
Code & Formula Code/math extraction	Vision-Language Models	CodeFormulaV2 ⭐ Granite-Docling-258M
		Inference Engines: Transformers, MLX, AUTO_INLINE
		Purpose: Extracts and recognizes code blocks and mathematical formulas

Inference Engines by Model Family

Object Detection Models (Layout)

Model	Inference Engine	Supported Devices
All Layout models	docling-ibm-models	CPU, CUDA, MPS, XPU

Note: Layout models use a specialized RT-DETR-based object detection framework from docling-ibm-models.

TableFormer Models (Table Structure)

Model	Inference Engine	Supported Devices
TableFormer (fast)	docling-ibm-models	CPU, CUDA, XPU
TableFormer (accurate)	docling-ibm-models	CPU, CUDA, XPU

Note: MPS is currently disabled for TableFormer due to performance issues.

Image Classifier (Picture Classifier)

Model	Inference Engine	Supported Devices
DocumentFigureClassifier-v2.0	Transformers (ViT)	CPU, CUDA, MPS, XPU

OCR Engines

OCR Engine	Backend	Language Support	Notes
Tesseract	CLI or tesserocr	100+ languages	Most widely used, good accuracy
EasyOCR	PyTorch	80+ languages	GPU-accelerated, good for Asian languages
RapidOCR	ONNX/OpenVINO/Paddle	Multiple	Fast, multiple backend options
macOS Vision	Native macOS	20+ languages	macOS only, excellent quality
SuryaOCR	PyTorch	90+ languages	Modern, good for complex layouts
Auto	Automatic	Varies	Automatically selects best available engine

Vision-Language Models (VLM)

VLM Convert Stage

Preset ID	Model	Parameters	Transformers	MLX	API (OpenAI-compatible)	vLLM	Output Format
`granite_docling`	Granite-Docling-258M	258M	✅	✅	Ollama	❌	DocTags
`smoldocling`	SmolDocling-256M	256M	✅	✅	❌	❌	DocTags
`deepseek_ocr`	DeepSeek-OCR-3B	3B	❌	❌	Ollama LM Studio	❌	Markdown
`granite_vision`	Granite-Vision-3.3-2B	2B	✅	❌	Ollama LM Studio	✅	Markdown
`pixtral`	Pixtral-12B	12B	✅	✅	❌	❌	Markdown
`got_ocr`	GOT-OCR-2.0	-	✅	❌	❌	❌	Markdown
`phi4`	Phi-4-Multimodal	-	✅	❌	❌	✅	Markdown
`qwen`	Qwen2.5-VL-3B	3B	✅	✅	❌	❌	Markdown
`gemma_12b`	Gemma-3-12B	12B	❌	✅	❌	❌	Markdown
`gemma_27b`	Gemma-3-27B	27B	❌	✅	❌	❌	Markdown
`dolphin`	Dolphin	-	✅	❌	❌	❌	Markdown

Picture Description Stage

Preset ID	Model	Parameters	Transformers	MLX	API (OpenAI-compatible)	vLLM
`smolvlm`	SmolVLM-256M	256M	✅	✅	LM Studio	❌
`granite_vision`	Granite-Vision-3.3-2B	2B	✅	❌	Ollama LM Studio	✅
`pixtral`	Pixtral-12B	12B	✅	✅	❌	❌
`qwen`	Qwen2.5-VL-3B	3B	✅	✅	❌	❌

Code & Formula Stage

Preset ID	Model	Parameters	Transformers	MLX
`codeformulav2`	CodeFormulaV2	-	✅	❌
`granite_docling`	Granite-Docling-258M	258M	✅	✅

Usage Examples

Layout Detection

from docling.datamodel.pipeline_options import LayoutOptions
from docling.datamodel.layout_model_specs import DOCLING_LAYOUT_HERON

# Use Heron layout model (default)
layout_options = LayoutOptions(model_spec=DOCLING_LAYOUT_HERON)

Table Structure Recognition

from docling.datamodel.pipeline_options import TableStructureOptions, TableFormerMode

# Use accurate mode for best quality
table_options = TableStructureOptions(
    mode=TableFormerMode.ACCURATE,
    do_cell_matching=True
)

Picture Classification

from docling.models.stages.picture_classifier.document_picture_classifier import (
    DocumentPictureClassifierOptions
)

# Use default picture classifier
classifier_options = DocumentPictureClassifierOptions()

OCR

from docling.datamodel.pipeline_options import TesseractOcrOptions

# Use Tesseract with English and German
ocr_options = TesseractOcrOptions(lang=["eng", "deu"])

VLM Convert (Full Page)

from docling.datamodel.pipeline_options import VlmConvertOptions

# Use SmolDocling with auto-selected engine
options = VlmConvertOptions.from_preset("smoldocling")

# Or force specific engine
from docling.datamodel.vlm_engine_options import MlxVlmEngineOptions
options = VlmConvertOptions.from_preset(
    "smoldocling",
    engine_options=MlxVlmEngineOptions()
)

Picture Description

from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions

# Use Granite Vision for detailed descriptions
options = PictureDescriptionVlmOptions.from_preset("granite_vision")

Code & Formula Extraction

from docling.datamodel.pipeline_options import CodeFormulaVlmOptions

# Use specialized CodeFormulaV2 model
options = CodeFormulaVlmOptions.from_preset("codeformulav2")

Additional Resources

Vision Models Usage Guide - VLM-specific documentation
Advanced Options - Advanced configuration
GPU Support - GPU acceleration setup
Supported Formats - Input format support

Notes

DocTags Format: Structured XML-like format optimized for document understanding
Markdown Format: Human-readable format for general-purpose conversion
Model Updates: New models are added regularly. Check the codebase for latest additions
Engine Compatibility: Not all engines work on all platforms. AUTO_INLINE handles this automatically
Performance: Actual performance varies by hardware, document complexity, and model size