29/04/2025, 18:24 [Feature] IBM's SmolDocling integration for ocr ai as an another option for ocr in pdf2parquet · Issue #1145
· Issue #1145 · data-prep-kit/data-prep-kit
data-prep-kit /
data-prep-kit
Code Issues 170 Pull requests 21 Discussions Actions Projec
Edit New issue
[Feature] IBM's SmolDocling integration for
another option for ocr in pdf2parquet #1145 ocr ai as an
Open
Assignees
Labels enhancement under-review
ShiroYasha18 opened on Mar 19
Search before asking
I searched the issues and found no similar issues.
Component
transforms/pdf2parquet
Feature
Hi I was recently been working with pdf2parquet for couple of months and I have tested the
pdf2parquet on multiple documents for my internship project at IBM. I have seen that traditional ocr
fails on handwritten documents and documents with other issues like multicolor and different size of
fonts which is very classical for the traditional ocrs. For my project I used a vlm but recently IBM's
docling team released smoldocling which sure is pretty impressive. I would like to contribute to this
project by integrating that feature in the pdf2parquet pipeline! This will enable the support for pdfs
and maybe images too in future there comes a image2parquet around.
Are you willing to submit a PR?
Yes I am willing to submit a PR!
ShiroYasha18 added enhancement on Mar 19
https://github.com/data-prep-kit/data-prep-kit/issues/1145 1/8
29/04/2025, 18:24 [Feature] IBM's SmolDocling integration for ocr ai as an another option for ocr in pdf2parquet · Issue #1145 · data-prep-kit/data-prep-kit
touma-I on Mar 21 Collaborator
@ShiroYasha18 Thank you. I would like to tag @dolfim-ibm for his thoughts on this. Also, Do you
happen to have specific example of sample data that shows where the gap is in today's release of the
code and how the proposed enhancement will address those gaps?
touma-I assigned ShiroYasha18 on Mar 21
dolfim-ibm on Mar 21 Contributor
You basically ehave to upgrade the Docling version and reproduce this PR which exposes the option
in the Docling CLI: https://github.com/docling-project/docling/pull/1199/files#diff-
bab084dbf6f5d7e3159fc059293894cf3cf58bf0fa70bd154382e03d9ba0184b.
In practice:
1. Upgrade Docling
2. Define the new pipeline option
3. Modify the pdf2parquet code setting the pipeline_options similar to the PR linked above.
4. This might also require updating the test results.
agoyal26 on Mar 25 Collaborator
@ShiroYasha18 Did the above steps help ?
ShiroYasha18 on Mar 25 Author
Hello sorry for the late reply !
@touma-I Thank you for assigning this issue to me. I will do the PR for the same. The major sample
data which I have tested this for is handwritten answer sheets which are present in the educational
institutions . Basically the basic ocr doesnot perform well for such data so it was expected that the
the text wont be extracted . here is a sample image I am talking about [
https://github.com/data-prep-kit/data-prep-kit/issues/1145 2/8
29/04/2025, 18:24 [Feature] IBM's SmolDocling integration for ocr ai as an another option for ocr in pdf2parquet · Issue #1145 · data-prep-kit/data-prep-kit
https://github.com/data-prep-kit/data-prep-kit/issues/1145 3/8
29/04/2025, 18:24 [Feature] IBM's SmolDocling integration for ocr ai as an another option for ocr in pdf2parquet · Issue #1145 · data-prep-kit/data-prep-kit
SmolDocling as its is a vision based model it improves the ocr capablities taking account the "layout"
too which is a major problem of traditional ocr like easyocr. It extracted good amount of text from this
image . As much as I have tested those it works pretty well with things like handwritten text
documents like medical invoices etc. @dolfim-ibm Thank you for the steps and the guidance I will
look into these and let you know if there be any other problem !
touma-I assigned shahrokhDaijavad 28 days ago
touma-I added under-review 28 days ago
shahrokhDaijavad 28 days ago Collaborator
@ShiroYasha18 I just went through this issue and understood what this is about and how it can be
fixed (using the steps by @dolfim-ibm). The reason @touma-I assigned it to me is just to follow-up
with you and when you have submitted a PR, review your PR. When do you think you will be able to
submit a PR?
ShiroYasha18 27 days ago · edited by ShiroYasha18 Edits Author
Hi @dolfim-ibm, @shahrokhDaijavad,
I wanted to provide a quick update on this issue. First, my apologies for the delayed follow-up—I’ve
been working through the implementation details and testing SmolDocling locally to ensure a smooth
integration. As I’m relatively new to the codebase, I’ve been taking time to thoroughly understand the
workflow (especially referencing PR #1199 ) to avoid missteps.
That said, I’m treating this as high priority and will submit a PR by next Wednesday (9/04/2025) at
the latest. If there are any specific considerations or potential roadblocks I should be aware of, please
let me know! I’ll also share incremental updates if that helps.
Thank you for your patience—I’m committed to seeing this through and will make sure it’s done right.
shahrokhDaijavad 27 days ago Collaborator
Thanks, @ShiroYasha18. Sounds good. What is PR #1199 that you are referring to? You must mean a
different PR.
https://github.com/data-prep-kit/data-prep-kit/issues/1145 4/8
29/04/2025, 18:24 [Feature] IBM's SmolDocling integration for ocr ai as an another option for ocr in pdf2parquet · Issue #1145 · data-prep-kit/data-prep-kit
ShiroYasha18 27 days ago · edited by ShiroYasha18 Edits Author
The one @dolfim-ibm gave me to look into
https://github.com/docling-project/docling/pull/1199/files#diff-
bab084dbf6f5d7e3159fc059293894cf3cf58bf0fa70bd154382e03d9ba0184b
my apologies for confusion its from docling repo
shahrokhDaijavad 27 days ago Collaborator
Thanks for the clarification, @ShiroYasha18 !
ShiroYasha18 last month · edited by ShiroYasha18 Edits Author
@shahrokhDaijavad can you please help me with this?
@dolfim-ibm quick updates:
so I read the code in both the repos and saw the PR you mentioned.
I get that as the SmolDocling is already integrated in docling and also pdf2parquet of the dpk uses
docling directly without having a copy folder in this repo. So technically with your steps, if I upgrade
the version of the docling the support for SmolDocling gets unlocked . Now once this is done from
what I understand is there is no file pipeline_options as docling is getting refrenced directly . The
PR was merged so I am assuming the pipeline options got updated . so I can import those pipeline
options for SmolDocling in this pdf2parquet code. What I am stuck with is once I import the
pipeline_options here how will it actually be used ? I have understood upto the point I import the
pipeline options here but then how to actually use those pipeline options ? I can also see that the
code in pipeline_options also contains the function to call the SmolDocling but how to use that in
pdf2parquet code?
Open [Feature] IBM's SmolDocling integration for ocr ai as an another option for … #1145
https://github.com/data-prep-kit/data-prep-kit/issues/1145 5/8
29/04/2025, 18:24 [Feature] IBM's SmolDocling integration for ocr ai as an another option for ocr in pdf2parquet · Issue #1145 · data-prep-kit/data-prep-kit
shahrokhDaijavad 14 hours ago Collaborator
Hi, @ShiroYasha18. Sorry that I haven't responded so far. Now that pdf2parquet => docling2parquet
transition has completed, can you please experiment with the do_ocr parameter set to true on
your example file and see what result you get?
Add a comment
Write Preview
https://github.com/data-prep-kit/data-prep-kit/issues/1145 6/8
29/04/2025, 18:24 [Feature] IBM's SmolDocling integration for ocr ai as an another option for ocr in pdf2parquet · Issue #1145 · data-prep-kit/data-prep-kit
Use Markdown to format your comment
Paste, drop, or click to add files Close issue Comment
Remember, contributions to this repository should follow its contributing guidelines and code of conduct.
Metadata
Assignees
ShiroYasha18
shahrokhDaijavad
Labels
enhancement under-review
Type
No type
Projects
No projects
Milestone
No milestone
Relationships
None yet
Development
Code with Copilot Agent Mode
No branches or pull requests
Notifications Customize
Unsubscribe
You're receiving notifications because you're subscribed to this thread.
Participants
https://github.com/data-prep-kit/data-prep-kit/issues/1145 7/8
29/04/2025, 18:24 [Feature] IBM's SmolDocling integration for ocr ai as an another option for ocr in pdf2parquet · Issue #1145 · data-prep-kit/data-prep-kit
https://github.com/data-prep-kit/data-prep-kit/issues/1145 8/8