This is a Quarkus extension for the Docling project. Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
- 🗂️ Parsing of multiple document formats incl. PDF, DOCX, XLSX, HTML, images, and more
- 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
- 🧬 Unified, expressive DoclingDocument representation format
- ↪️ Various export formats and options, including Markdown, HTML, and lossless JSON
- 🔒 Local execution capabilities for sensitive data and air-gapped environments
- 🤖 Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
- 🔍 Extensive OCR support for scanned PDFs and images
- 🥚 Support of several Visual Language Models (SmolDocling)
- 💻 Simple and convenient CLI
Currently, this extension is a set of wrappers around the Docling Java project, which communicates with a Docling Serve instance via a REST API. This extension also provides a Dev Service and Dev UI integrations.
The eventual goal is to unify the DoclingDocument format with LangChain4j's Document abstraction so that Docling can be used in a LangChain4j RAG pipeline for ingesting data.
Take a look at the documentation for more information.
Or you can see an example with a video at: https://github.com/lordofthejars-ai/mission-impossible-rag
Thanks goes to these wonderful people (emoji key):
Eric Deandrea 💻 🚧 |
Alex Soto 💻 🚧 🖋 📖 🤔 |
Alina Yurenko 🐛 |
This project follows the all-contributors specification. Contributions of any kind welcome!