Introduction to Python Pytesseract Package
Last Updated :
20 Sep, 2024
In the age of digital transformation, extracting information from images and scanned documents has become a crucial task.
Optical Character Recognition (OCR) is a technology that enables this by converting different types of documents, such as scanned paper documents, PDF files, or images taken by a digital camera, into editable and searchable data. The Pytesseract module, a Python wrapper for Google's Tesseract-OCR Engine, is one of the most popular tools for this purpose.
What is Pytesseract?
Pytesseract is an OCR tool for Python, which enables developers to convert images containing text into string formats that can be processed further. It is essentially a Python binding for Tesseract, which is one of the most accurate open-source OCR engines available today.
Steps to Download and Configure Tesseract-OCR
1. Download and Install Tesseract-OCR
Tesseract is a free and open-source OCR (Optical Character Recognition) engine.
For Windows:
- Go to the official Tesseract GitHub release page: https://github.com/UB-Mannheim/tesseract/wiki
- Download the Windows installer (
tesseract-ocr-setup.exe
) from the releases section. - Run the installer and complete the installation process.
Download Tesseract-OCRFor macOS:
- We can install Tesseract via Homebrew:
brew install tesseract
For Linux (Ubuntu/Debian):
- Install Tesseract using the package manager:
sudo apt update
sudo apt install tesseract-ocr
2. Add Tesseract to the Environment Variables
For Windows:
- Open the Start Menu, search for Environment Variables, and select Edit the system environment variables.
- In the System Properties window, click the Environment Variables button.
- Under System variables, find and select the
Path
variable, then click Edit. - Click New and add the path to the
tesseract.exe
file (usually located in C:\Program Files\Tesseract-OCR\
). - Click OK to save changes.
For macOS and Linux:
Open a terminal and add the Tesseract path to the shell configuration file (.bashrc
, .zshrc
, or .bash_profile
):
export PATH="/usr/local/bin/tesseract:$PATH"
Save the file and run the following command to apply changes:
source ~/.bashrc # or source ~/.zshrc
3. Verify Tesseract Installation
After adding Tesseract to our environment variables, open a terminal (or Command Prompt on Windows) and type:
tesseract --version
check tesseract version4. Install Pytesseract:
To use Tesseract with Python, we also need to install the pytesseract
package, which acts as a Python wrapper for Tesseract.
Let's install pytesseract using pip:
pip install pytesseract
Verify the installation:
python -c "import pytesseract; print(pytesseract.get_tesseract_version())"
Check pytesseract versionUsage of Pytesseract With Practical Examples
In the first and second example, we have used the first image and for the third example we have used the second image.
Example 1: Basic Text Extraction from an Image
Explanation : This simple example shows how to open an image and use Pytesseract to extract the text. The image_to_string method converts the image into text.
Python
from PIL import Image
import pytesseract
# Load an image
img = Image.open('image.png')
# Extract text from image
text = pytesseract.image_to_string(img)
print(text)
Output
Reading text from image using pytessactExample 2 : Extracting Text from a Specific Area in an Image
Explanation: Sometimes, we may need to extract text from a specific area of the image. This example shows how to crop a region of interest (ROI) from the image and extract text from that region.
Python
from PIL import Image
import pytesseract
# Load an image
img = Image.open('image.png')
# Define the region of interest (left, upper, right, lower)
roi = img.crop((50, 50, 300, 300))
# Extract text from the cropped region
text = pytesseract.image_to_string(roi)
print(text)
Output:
Extracting Text from a Specific part of the imageExample 3: Text Extraction from a Noisy Image with Pre-processing
Explanation : In this example, the image is first converted to grayscale to simplify the data. A median filter is then applied to reduce noise, which is especially helpful for images with a lot of background noise. After that, the contrast of the image is enhanced to make the text more distinguishable from the background. Finally, the pre-processed image is passed to Pytesseract for text extraction.
Python
from PIL import Image, ImageFilter, ImageEnhance
import pytesseract
# Load an image with noise or low contrast
img = Image.open('noisy_image.jpg')
# Convert the image to grayscale
img = img.convert('L')
# Apply a median filter to reduce noise
img = img.filter(ImageFilter.MedianFilter())
# Enhance the image contrast
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2)
# Extract text from the pre-processed image
text = pytesseract.image_to_string(img)
print(text)
Output
Reading Text from a noisy image using pytesseractAdvantages of Pytesseract Module
- Accuracy: Pytesseract is based on Tesseract-OCR, which is known for its high accuracy in text extraction, especially for printed documents.
- Language Support: It supports over 100 languages, making it versatile for various applications worldwide.
- Open Source: Both Pytesseract and Tesseract-OCR are open-source, allowing for free usage and modification according to project needs.
- Ease of Use: With simple integration into Python projects, Pytesseract provides an easy way to implement OCR functionality.
Limitations of Pytesseract Module
- Performance: Pytesseract can be slow when processing large documents or images with a lot of text, making it less suitable for real-time applications.
- Complex Layouts: It may struggle with images that have complex layouts, such as those containing tables or multiple columns of text.
- Accuracy with Handwriting: While it's quite effective with printed text, its accuracy decreases significantly with handwritten text.
- Image Quality Dependence: The quality of the extracted text heavily depends on the quality of the input image. Low-resolution or noisy images may lead to incorrect text extraction.
When to Use
- Digitizing Printed Documents: If we need to convert printed text from scanned documents into editable text.
- Extracting Text from Images: Useful in scenarios where we have images containing text that we need to process further.
- Multi-language Text Extraction: When our application requires OCR in different languages, Pytesseract’s extensive language support comes in handy.
When Not to Use
- Real-Time Processing: For applications requiring real-time text extraction, Pytesseract might not be the best choice due to its performance limitations.
- Complex Layouts or Handwritten Text: If our documents have complex layouts or handwritten text, we may need a more sophisticated OCR solution or additional pre-processing steps.
- High-Resolution Image Requirement: If the images are of low quality or resolution, the results may be inaccurate, making Pytesseract less effective.
Conclusion
Pytesseract is a powerful and accessible tool for anyone looking to incorporate OCR functionality into their Python projects. While it has its limitations, particularly with handwritten text and complex layouts, it excels in extracting text from images and printed documents with high accuracy. Whether we're working on digitizing documents or extracting text from images in multiple languages, Pytesseract provides a simple and effective solution.
Similar Reads
Introduction to Python Pydantic Library
In modern Python development, data validation and parsing are essential components of building robust and reliable applications. Whether we're developing APIs, working with configuration files, or handling data from various sources, ensuring that our data is correctly validated and parsed is crucial
6 min read
How to Install Python-pymarc package on Linux?
Pymarc is a python package for working with bibliographic data encoded in MARC21. Pymarc provides an API for reading, writing, and revising MARC records. It was mostly designed to be an emergency eject seat, forgetting your data assets out of MARC and into some kind of more rational representation.
2 min read
How to Install Python-pytaglib package on Linux?
Pytaglib is a package in Python used for audio tagging. It is a cross-platformed library that works on Windows, Linux, and macOS. This package supports mp3, flac, ogg, etc file formats. It also supports arbitrary, non-standard tag names. So, in this article, we will be installing the Pytaglib packag
1 min read
How to install pyscopg2 package in Python?
psycopg2 is the most popular PostgreSQL database adapter for the Python programming language. It is a DB API 2.0 compliant PostgreSQL driver is actively developed. It is designed for heavily multi-threaded applications that create and destroy lots of cursors and create a large number of "INSERTs" or
1 min read
How to Install Python-pyautogui package on Linux?
PyAutoGUI allows your Python programs to automate interactions with other apps by controlling the mouse and keyboard. The API was created with simplicity in mind. PyAutoGUI is a cross-platformed library that works on Windows, macOS, and Linux and runs on Python 2 and 3. We will install the PyAutoGUI
2 min read
How to Install PyBullet package in Python on Linux?
PyBullet is a Python library based on the Bullet Physics SDK for physics simulation and robotics. It works with a variety of sensors, actuators, and terrains. Basically, this is a fantastic library for anyone who is just getting started with robotics or physics modeling. So, in this tutorial, we'll
2 min read
Introduction to Python Black Module
Python, being a language known for its readability and simplicity, offers several tools to help developers adhere to these principles. One such tool is Black, an uncompromising code formatter for Python. In this article, we will delve into the Black module, exploring what it is, how it works, and wh
5 min read
How to Install "Python-PyPDF2" package on Linux?
PyPDF2 is a Python module for extracting document-specific information, merging PDF files, separating PDF pages, adding watermarks to files, encrypting and decrypting PDF files, and so on. PyPDF2 is a pure python module so it can run on any platform without any platform-related dependencies on any e
2 min read
Python Introduction
Python was created by Guido van Rossum in 1991 and further developed by the Python Software Foundation. It was designed with focus on code readability and its syntax allows us to express concepts in fewer lines of code.Key Features of PythonPythonâs simple and readable syntax makes it beginner-frien
3 min read
Introduction to PyVista in Python
Pyvista is an open-source library provided by Python programming language. It is used for 3D plotting and mesh analysis. It also provides high-level API to simplify the process of visualizing and analyzing 3D data and helps scientists and other working professionals in their field to visualize the d
4 min read