MP Final Report
MP Final Report
Contents
1 Problem Statement 3
2 Motivation 3
3 Dataset 3
4 Methods 3
4.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1.1 Challenges Faced . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2.1 Challenges Faced . . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Experiments 8
5.1 Fixed Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Adaptive Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.3 Canny Edge Detection and Morphology . . . . . . . . . . . . . . . . . 12
5.4 Median Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1
5.5 Random Forest Regression . . . . . . . . . . . . . . . . . . . . . . . . 16
5.6 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Conclusion 18
7 Future Work 19
List of Figures
1 Using Random Forest a) . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Using Random Forest b) . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Artificial Neural Network [4] . . . . . . . . . . . . . . . . . . . . . . . 6
4 Neural Network a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Neural Network b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6 Fixed Thresholding on Real World Image a) . . . . . . . . . . . . . . 9
7 Fixed Thresholding on Real World Image b) . . . . . . . . . . . . . . 9
8 Fixed Thresholding on Real World Image c) . . . . . . . . . . . . . . 10
9 Adaptive Thresholding on Real World Image a) . . . . . . . . . . . . 10
10 Adaptive Thresholding on Real World Image b) . . . . . . . . . . . . 11
11 Adaptive Thresholding on Real World Image c) . . . . . . . . . . . . 11
12 Canny Edge Detection and Morphology on Real World Image a) . . . 12
13 Canny Edge Detection and Morphology on Real World Image b) . . . 13
14 Canny Edge Detection and Morphology on Real World Image c) . . . 14
15 Median Filtering on Real World Image a) . . . . . . . . . . . . . . . . 15
16 Median Filtering on Real World Image b) . . . . . . . . . . . . . . . 15
17 Median Filtering on Real World Image c) . . . . . . . . . . . . . . . . 16
18 Artificial Neural Network on Real World Image a) . . . . . . . . . . . 17
19 Artificial Neural Network on Real World Image b) . . . . . . . . . . . 17
20 Artificial Neural Network on Real World Image c) . . . . . . . . . . . 18
List of Tables
1 Table with methods and their RMSE scores . . . . . . . . . . . . . . 19
2
1 Problem Statement
Given a dataset of images of scanned text (synthetic images) that are “noisy” with
stains and wrinkles, we propose to clean up the noise and help with the digitization
process.
2 Motivation
Optical Character Recognition (OCR) is the process of getting typed or handwritten
documents into a digitized format. The motivation of converting to a digitized format
is to ensure security, accessibility, edit-ability and ease of searching and sharing. Also,
digital documents don’t get dirty and cannot be ruined by coffee stains. [2]
Unfortunately, a lot of documents eager for digitization are being held back. Cof-
fee stains, faded sun spots, dog-eared pages, and lot of wrinkles are keeping some
printed documents offline and in the past. We were interested in speeding up this
process and hence chose this topic.
3 Dataset
Kaggle provided a data-set which consists of two sets of images - train and test.
These images contain various styles of text, to which synthetic noise has been added
to simulate real-world, messy documents. The dirty images contain stains as well
as creased paper. The training set also includes the cleaned up images of those
found in the test file (train_cleaned) [2]. By clean, we mean black letters on a white
background.
Additionally, a set of real images were procured, which contained stains and
creases. We tested on these images to check if the algorithms developed using simu-
lated data can be applied on the "real-world" messy documents.
Kaggle calculates the score based on the root-mean-squared-error (RMSE) value
between each pixels of the generated output and the actual cleaned image.
4 Methods
In the midterm progress report [5], we tried out some image processing techniques,
some of which worked well in removing the noises while others were not so efficient.
3
Here, we propose methods that involve Machine Learning and Neural Networks as
theorized in [1] and [3].
Algorithm:
• Pad out each image by an extra 2 pixels (i.e.) N xN becomes (N + 2)x(N + 2).
• Run a 3×3 sliding window on the image. Please note that every pixel of the
original image will at least become the center of the sliding window once.
• Use all 9 pixels within the sliding window as predictors for the pixel in the
centre of the sliding window (i.e) All the pixels in the sliding window of the
dirty image acts as a feature to predict the centre pixel of the window for the
cleaned pixel.
4
(a) Original Image (b) Cleaned Image
While this method succeeds in removing the stains [2], it does not work very well
with dog-ears and creases [1], in fact random forest just makes it worse. It looks as
if random forest takes the stain and sprinkle it across the entire image so that the
stains are not concentrated in one particular spot but more milder but widespread.
This, as one can see from the cleaned image, is not conducive for reading and thus
will not help us in our goal of converting to a digitized format for future use.
act(input ∗ W + b)
where act is typically some sort of sigmoid function.
The activation function of the input layer is the tanh function, while the activa-
tion function for the hidden layer is the clip function of theano which clips the value
5
based on the given minimum and maximum value (i.e)
1
d e f c l i p ( x , minx , maxx ) :
3 i f ( x < min ) :
r e t u r n minx
5 e l i f ( x > max) :
r e t u r n maxx
7 return x
The hidden layer contains 10 neurons, the no. of neurons for the input is 29
(which is the no. of feature vectors) and output layers has one neuron which is the
pixel brightness.
Before passing the images to the neural network, we first calculate the features of
the image. We consider neighbouring pixels of the center pixel using a 5x5 window
6
as boundary as features. So for each pixel we have a feature vector containing 25
feature points. Also we do some initial image processing on these image and take the
output as features for the neural network. We use median blur with kernel size 5 and
kernel size 25. Using the Sobel operative we calculate the first and second derivative
of the images. For each pixel of the image, we have 4 image processing outputs, the
median blur with kernel size 5,the median blur with kernel size 25,first sobel derivtive
and second derivative. These are then added to the already existing 25 feature points
making the total to 29 feature points for each pixel. The feature vectors are combines
together to create a feature matrix for the image and given to the neural network.
Central Idea:
• Output is the de-noised pixel (i.e) the intensity of the cleaned pixel.
We train the neural network using a naive gradient descent learning algorithm
with the entire data-set.
7
(a) Original Image (b) Cleaned Image
As you can see from [4] and [5] the creases are pretty much invisible to the eye
while the stains are faded to the point that only faint patches are visible.
The RMSE score in Kaggle is 0.03363.
5 Experiments
We experimented these methods and methods mentioned in [5] with "real-world" data
(i.e.) actual images of text paper with stains. The results were pretty varied as you
can see below
8
5.1 Fixed Thresholding
9
(a) Original Image (b) Cleaned Image
Fixed Thresholding does not really help us in cleaning stains. As seen in the above fig-
ures, the shadows affect the image and it binarises the image when fixed thresholding
is applied. As for the stains, it completely darkens them making it worse.
10
(a) Original Image (b) Cleaned Image
11
5.3 Canny Edge Detection and Morphology
Figure 12: Canny Edge Detection and Morphology on Real World Image a)
12
(a) Original Image (b) Cleaned Image using Dilation
Figure 13: Canny Edge Detection and Morphology on Real World Image b)
13
(a) Original Image (b) Cleaned Image using Dilation
Figure 14: Canny Edge Detection and Morphology on Real World Image c)
Canny Edge with morphological operation seems to remove some of the stains but
it either thickens the text or thins them to the point of illegibility. The goal is to
remove the stains and keep the texts as it is.
14
5.4 Median Filtering
15
(a) Original Image (b) Cleaned Image
As seen in the figures above, Median Filtering somewhat removes the coffee stains and
rest of the background noise from the document, leaving little noise here and there.
The contrast of the image is also degraded. This may be due to the subtraction of
the background from the original image.
16
5.6 Artificial Neural Network
17
(a) Original Image (b) Cleaned Image
ANN is able to remove coffee stains easily. While there are some small stains in the
image, they do not affect the readability of the paper. In the original image where
stains cover the text, ANN is successful in removing only the stains in most cases. In
others, the stains along with the text is removed but these are far and few in between.
6 Conclusion
Comparing the results of all the methods listed we find that ANN works the best. It
removes the stains & crevices and it is readable!! While the other methods remove
stains, the text is quite hard to decipher as it is blurred or the ink is too thin. The
RMSE values of the test data using our methods and the original image are listed in
the table below:
18
Methods/Score RMSE (%)
Fixed Thresholding 35.173%
Adaptive Thresholding 42.228%
Canny Edge (Dilation) 51.638%
Canny Edge (Erotion) 36.547%
Median Blur 55.096%
Random Forest Regressor 32.492%
Artificial Neural Network 3.363%
7 Future Work
The images that we were able to clean are images of English texts. We plan on
expanding this to cover texts in other languages, figures, combination of both facts
and figures.
We have a tentative plan to create an android application that can remove stains and
creases using the above mentioned methods.
References
[1] Colin blog. https://colinpriest.com/2015/09/07/
denoising-dirty-documents-part-6/. Accessed: 2017-05-14.
[5] Dokania S. Reddy V. Manoj R., Vats U. Denoising dirty documents. http:
//tinyurl.com/mbl4p66, 2017.
19