Skip to content

Model Catalog

This document provides a comprehensive overview of all models and inference engines available in Docling, organized by processing stage.

Overview

Docling's document processing pipeline consists of multiple stages, each using specialized models and inference engines. This catalog helps you understand:

  • What stages are available for document processing
  • Which model families power each stage
  • What specific models you can use
  • Which inference engines support each model

Stages and Models Overview

The following table shows all processing stages in Docling, their model families, and available models.

Stage Model Family Models
Layout
Document structure detection
Object Detection
(RT-DETR based)
  • docling-layout-heron
  • docling-layout-heron-101
  • docling-layout-egret-medium
  • docling-layout-egret-large
  • docling-layout-egret-xlarge
  • docling-layout-v2 (legacy)
Inference Engine: Transformers, ONNXRuntime (in progress)
Purpose: Detects document elements (paragraphs, tables, figures, headers, etc.)
Output: Bounding boxes with element labels (TEXT, TABLE, PICTURE, SECTION_HEADER, etc.)
OCR
Text recognition
Multiple OCR Engines
  • Auto
  • Tesseract (CLI or Python bindings)
  • EasyOCR
  • RapidOCR (ONNX, OpenVINO, PaddlePaddle)
  • macOS Vision (native macOS)
  • SuryaOCR
Inference Engines: Engine-specific
Purpose: Extracts text from images and scanned documents
Table Structure
Table cell recognition
TableFormer
  • TableFormer (accurate mode)
  • TableFormer (fast mode)
Inference Engine: docling-ibm-models
Purpose: Recognizes table structure (rows, columns, cells) and relationships
Table Structure
Table cell recognition
Object Detection
  • Work in progress
Inference Engine: TBD
Purpose: Alternative approach for table structure recognition using object detection
Picture Classifier
Image type classification
Image Classifier
(Vision Transformer)
  • DocumentFigureClassifier-v2.0
Inference Engine: Transformers
Purpose: Classifies pictures into categories (Chart, Diagram, Natural Image, etc.)
VLM Convert
Full page conversion
Vision-Language Models
  • Granite-Docling-258M ⭐ (DocTags)
  • SmolDocling-256M (DocTags)
  • DeepSeek-OCR-3B (Markdown, API-only)
  • Granite-Vision-3.3-2B (Markdown)
  • Pixtral-12B (Markdown)
  • GOT-OCR-2.0 (Markdown)
  • Phi-4-Multimodal (Markdown)
  • Qwen2.5-VL-3B (Markdown)
  • Gemma-3-12B/27B (Markdown, MLX-only)
  • Dolphin (Markdown)
Inference Engines: Transformers, MLX, API (Ollama, LM Studio, OpenAI), vLLM, AUTO_INLINE
Purpose: Converts entire document pages to structured formats (DocTags or Markdown)
Output Formats: DocTags (structured), Markdown (human-readable)
Picture Description
Image captioning
Vision-Language Models
  • SmolVLM-256M
  • Granite-Vision-3.3-2B
  • Pixtral-12B
  • Qwen2.5-VL-3B
Inference Engines: Transformers, MLX, API (Ollama, LM Studio), vLLM, AUTO_INLINE
Purpose: Generates natural language descriptions of images and figures
Code & Formula
Code/math extraction
Vision-Language Models
  • CodeFormulaV2
  • Granite-Docling-258M
Inference Engines: Transformers, MLX, AUTO_INLINE
Purpose: Extracts and recognizes code blocks and mathematical formulas

Inference Engines by Model Family

Object Detection Models (Layout)

Model Inference Engine Supported Devices
All Layout models docling-ibm-models CPU, CUDA, MPS, XPU

Note: Layout models use a specialized RT-DETR-based object detection framework from docling-ibm-models.

TableFormer Models (Table Structure)

Model Inference Engine Supported Devices
TableFormer (fast) docling-ibm-models CPU, CUDA, XPU
TableFormer (accurate) docling-ibm-models CPU, CUDA, XPU

Note: MPS is currently disabled for TableFormer due to performance issues.

Image Classifier (Picture Classifier)

Model Inference Engine Supported Devices
DocumentFigureClassifier-v2.0 Transformers (ViT) CPU, CUDA, MPS, XPU

OCR Engines

OCR Engine Backend Language Support Notes
Tesseract CLI or tesserocr 100+ languages Most widely used, good accuracy
EasyOCR PyTorch 80+ languages GPU-accelerated, good for Asian languages
RapidOCR ONNX/OpenVINO/Paddle Multiple Fast, multiple backend options
macOS Vision Native macOS 20+ languages macOS only, excellent quality
SuryaOCR PyTorch 90+ languages Modern, good for complex layouts
Auto Automatic Varies Automatically selects best available engine

Vision-Language Models (VLM)

VLM Convert Stage

Preset ID Model Parameters Transformers MLX API (OpenAI-compatible) vLLM Output Format
granite_docling Granite-Docling-258M 258M Ollama DocTags
smoldocling SmolDocling-256M 256M DocTags
deepseek_ocr DeepSeek-OCR-3B 3B Ollama
LM Studio
Markdown
granite_vision Granite-Vision-3.3-2B 2B Ollama
LM Studio
Markdown
pixtral Pixtral-12B 12B Markdown
got_ocr GOT-OCR-2.0 - Markdown
phi4 Phi-4-Multimodal - Markdown
qwen Qwen2.5-VL-3B 3B Markdown
gemma_12b Gemma-3-12B 12B Markdown
gemma_27b Gemma-3-27B 27B Markdown
dolphin Dolphin - Markdown

Picture Description Stage

Preset ID Model Parameters Transformers MLX API (OpenAI-compatible) vLLM
smolvlm SmolVLM-256M 256M LM Studio
granite_vision Granite-Vision-3.3-2B 2B Ollama
LM Studio
pixtral Pixtral-12B 12B
qwen Qwen2.5-VL-3B 3B

Code & Formula Stage

Preset ID Model Parameters Transformers MLX
codeformulav2 CodeFormulaV2 -
granite_docling Granite-Docling-258M 258M

Usage Examples

Layout Detection

from docling.datamodel.pipeline_options import LayoutOptions
from docling.datamodel.layout_model_specs import DOCLING_LAYOUT_HERON

# Use Heron layout model (default)
layout_options = LayoutOptions(model_spec=DOCLING_LAYOUT_HERON)

Table Structure Recognition

from docling.datamodel.pipeline_options import TableStructureOptions, TableFormerMode

# Use accurate mode for best quality
table_options = TableStructureOptions(
    mode=TableFormerMode.ACCURATE,
    do_cell_matching=True
)

Picture Classification

from docling.models.stages.picture_classifier.document_picture_classifier import (
    DocumentPictureClassifierOptions
)

# Use default picture classifier
classifier_options = DocumentPictureClassifierOptions()

OCR

from docling.datamodel.pipeline_options import TesseractOcrOptions

# Use Tesseract with English and German
ocr_options = TesseractOcrOptions(lang=["eng", "deu"])

VLM Convert (Full Page)

from docling.datamodel.pipeline_options import VlmConvertOptions

# Use SmolDocling with auto-selected engine
options = VlmConvertOptions.from_preset("smoldocling")

# Or force specific engine
from docling.datamodel.vlm_engine_options import MlxVlmEngineOptions
options = VlmConvertOptions.from_preset(
    "smoldocling",
    engine_options=MlxVlmEngineOptions()
)

Picture Description

from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions

# Use Granite Vision for detailed descriptions
options = PictureDescriptionVlmOptions.from_preset("granite_vision")

Code & Formula Extraction

from docling.datamodel.pipeline_options import CodeFormulaVlmOptions

# Use specialized CodeFormulaV2 model
options = CodeFormulaVlmOptions.from_preset("codeformulav2")

Additional Resources

Notes

  • DocTags Format: Structured XML-like format optimized for document understanding
  • Markdown Format: Human-readable format for general-purpose conversion
  • Model Updates: New models are added regularly. Check the codebase for latest additions
  • Engine Compatibility: Not all engines work on all platforms. AUTO_INLINE handles this automatically
  • Performance: Actual performance varies by hardware, document complexity, and model size