Document Classification Using Machine Learning
Document Classification Using Machine Learning
SJSU ScholarWorks
Master's Projects Master's Theses and Graduate Research
Spring 5-25-2017
Recommended Citation
Basarkar, Ankit, "DOCUMENT CLASSIFICATION USING MACHINE LEARNING" (2017). Master's Projects. 531.
http://scholarworks.sjsu.edu/etd_projects/531
This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been
accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact
[email protected].
Ankit Basarkar
uyk^Ā Ā ý Ā Ā Ā ²Ā ĀÞĀ Ā Ā Ā
$Ā Ā Ā$Ā õĀ
Ā Ā
Ā ĀĀ
tnj{oĀ
!*
)&**
DOCUMENT CLASSIFICATION USING MACHINE LEARNING
A Project Report
Presented to
In Partial Fulfillment
by
ANKIT BASARKAR
MAY 2017
© 2017
Ankit Basarkar
by
ANKIT BASARKAR
MAY 2017
3
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
ABSTRACT
To perform document classification algorithmically, documents need to be represented
such that it is understandable to the machine learning classifier. The report discusses
the different types of feature vectors through which document can be represented and
later classified. The project aims at comparing the Binary, Count and TfIdf feature
vectors and their impact on document classification. To test how well each of the three
mentioned feature vectors perform, we used the 20-newsgroup dataset and converted
the documents to all the three feature vectors. For each feature vector representation,
we trained the Naïve Bayes classifier and then tested the generated classifier on test
documents. In our results, we found that TfIdf performed 4% better than Count
vectorizer and 6% better than Binary vectorizer if stop words are removed. If stop words
are not removed, then TfIdf performed 6% better than Binary vectorizer and 11% better
than Count vectorizer. Also, Count vectorizer performs better than Binary vectorizer,
if stop words are removed by 2% but lags behind by 5% if stop words are not removed.
Thus, we can conclude that TfIdf should be the preferred vectorizer for document
representation and classification.
4
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
ACKNOWLEDGEMENTS
This project work could not have been possible without the help of friends,
family members and the instructors who have supported me and guided me throughout
the project work.
I would like to specifically thank my project advisor, Dr. Leonard Wesley for
guiding me through the course of project work. This project could not have been
possible without his continuous efforts and his wisdom. Also, I would like to mention
along with his support; it was his pure perseverance that pushed me to perform better
during the project that significantly contributed towards its completion on schedule.
I would also like to thank the members of the Committee, Dr. Robert Chun
and Mr. Robin James for their continuous guidance and support. It gives you great
encouragement when you know that there are people whom you can reach out to
whenever you get stuck at something.
Finally, I would like to thank my parents Mr. Shirish Basarkar and Mrs.
Arpita Basarkar, family members, and friends for their perennial encouragement and
emotional support.
5
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
TABLE OF CONTENTS
6
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
6 RESULT ................................................................................................................ 39
REFERENCES............................................................................................................. 47
APPENDICIES ............................................................................................................ 49
7
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
List of Figures
Figure 1: An example of the business news group: (a) a training sample; (b) extracted
word list........................................................................................................................ 13
Figure 2: Document Classification using SVM ........................................................... 18
Figure 3 : Two-dimensional representation of documents in 2 classes separated by a
linear classifier. ............................................................................................................ 19
Figure 4 : Artificial Neural Net .................................................................................... 26
8
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
List of Tables
Table 1: Comparison of the four classification algorithms (NB, NN, DT and SS) 15
Table 2: Classification accuracies of combinations of multiple classifiers 15
Table 3: Precision/recall-breakeven point on the ten most frequent Reuters categories
and micro averaged performance over all Reuters categories. 17
Table 4 : Classification Accuracy of Standard Algorithms 21
Table 5 : Topic classification rates with random forests for 7 topics 24
Table 6 : Topic classification rates with other algorithms for 5 topics 24
Table 7 : Topic classification rates with other algorithms for 7 topics 24
Table 8 : Test Data Summary 27
Table 9 : Macro Averaging Results 27
Table 10 : Categories of Newsgroup Data Set 33
Table 11 : Comparison of Results 41
Table 12 : Classification report of Test Documents using TfIdf Vectorizer and Naive
Bayes Classifier 42
Table 13 : Schedule employed for Project 45
9
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
There are two main factors which contribute to making document classification
a challenging task: (a) feature extraction; (b) topic ambiguity. First, Feature extraction
deals with taking out the right set of features that accurately describes the document
and helps in building a good classification model. Second, many broad topic documents
are themselves so complicated that it becomes difficult to put it into any specific
category. Let us say a document that talks of about theocracy. In such document, it
would become tough to pick whether the document should be placed under the category
of politics or religion. Also, broad topic documents may contain terms that have
different meanings based on different context and may appear multiple times within a
document in different contexts [4].
10
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
Some of the techniques that are employed for document classification include
Expedition maximization, Naïve Bayes classifier, Support Vector Machine, Decision
Trees, Neural Network, etc.
Some of the applications that make use of the above techniques for document
classification are listed below:
11
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
This topic discusses the work done by various authors, students and researchers
in brief in the area under discussion, which is Document Classification using Machine
Learning algorithms. The purpose of this section is to critically summarize the current
knowledge in the field of document classification.
They employed Naïve Bayes, Decision Trees, Nearest neighbor classifier and
the Subspace method for classification. They also performed classification using the
combination of these algorithms.
12
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
the feature vector for each document, they made use of 2 approaches. The first is the
binary approach where for each word in vocabulary the value of 1 is given if the word
exists in the document D or 0 if it does not. In the second approach, the frequency of
each word is used to form a feature vector.
In this paper, they used a Binary representation for Naïve Bayes and Decision
trees method. Whereas, they used Frequency representation in Nearest neighbor
classifier and the Subspace method classifier to calculate the weight of each term.
13
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
To test the classifiers, they used, two different test data sets taken at different
time intervals. The authors first tested all the four machine learning algorithms
individually on both the test data sets. Then they tested the three combinations on both
the test data sets.
In their experiments, all the four machine learning algorithms performed well.
The naïve bayes gave the highest accuracy for first test data set but was outperformed
by subspace method for the second test data set.
14
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
Conclusion: The authors concluded that all the four classifiers performed
reasonably well on the Yahoo data-set. From the four algorithms, the Naive Bayes
method gave the highest accuracy. The authors also noticed that combination of
classifiers does not guarantee much improvement over the individual classifier.
2.2 Text Categorization with SVM: Learning with Many Relevant Features
Introduction: In this paper [2] the author Thorsten Joachims explored and
identified the benefits of Support Vector Machines (SVMs) for text categorization.
15
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
authors made use of word counts. Thus each document was represented as vector of
integers where each integer represented the number of times a corresponding word
occurred in the document. To avoid large feature vectors the author only considered
those words as features that took place more than three times in the document. The
authors also made sure to eliminate stop words in making feature vectors.
Data Set and Experiments: The author used two data sets for the model. The
first data set author used was ModApte split of the Reuters-21578 dataset which is
compiled by David Lewis. This dataset contained 9603 training documents and 3299
test documents. The dataset contained 135 categories of which only 90 were used since
only 90 categories had at least one training and test sample.
The second data set employed for model creation and evaluation was the
Ohsumed corpus which was compiled by William Herse. The author used 10000
documents for training and another 10000 for testing from the corpus that had around
50000 documents. The classification task on this data set was to assign each document
to one of the 23 MeSH diseases category.
The author compared the performance of SVMs with Naïve Bayes, Rocchio,
C4.5, and KNN for text categorization. The author used polynomial and RBF kernels
for SVM. The Precision/Recall Breakeven Point is used as a measure of performance
and micro-averaging is done to get a single value of performance for all classification
tasks. The author also ensured that the results are not biased towards any particular
method and thus he ran all the four methods after selecting the best 500, best 1000, best
2000, best 5000 or all features based on Information gain.
16
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
Conclusion: The author concludes that, among the conventional methods KNN
performed the best on Reuters data set. On the other hand, SVM achieved the best
classification results and outperformed all the conventional methods by a good margin.
The author states that SVM can perform well in high dimensional space and thus does
not mandate feature selection which is almost always required by other methods. The
author also concludes that SVMs are quite robust and performed well in virtually all
experiments.
The author observed similar results for the Ohsumed collection data set as well.
The results demonstrated that k-NN performed the best among conventional methods
whereas SVM outperformed all the other conventional classifiers.
17
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
analysis. This paper mainly deals with why we need Automatic Document classification
is in a way like the motivation posted above regarding Document Classification by
Machine Learning. This paper covers the inner workings of Support Vector Machine,
its application in the classification and its accuracy compared to manual classification.
Thus, in a way this paper is an extension to the motivation illustrated above regarding
why we need document classification using machine learning.
In the training phase of SVM, the algorithm is fed with labeled documents of
both categories. Internally SVM converts all the documents into a data point in high-
dimensional space. These points represent the documents. Then the algorithm tries to
18
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
find a hyperplane (separator) between these points such that it could separate the data
points of the two categories with maximum margin. Later, the hyperplane (also called
"the model") is recorded and used for classification of new documents. As seen in figure
3 the hyperplane divides the data points of class red and blue.
19
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
Introduction: In the paper [4] written by Russell Power, Jay Chen, Trishank
Karthik and Lakshminarayanan Subramanian, they propose a combination of feature
extraction and classification algorithms for classification of documents. In this paper,
they propose a simple feature extraction algorithm for development centric topics which
when coupled with standard classifiers yields high classification accuracy.
Popularity and Rarity: In this paper, they propose a simple feature extraction
algorithm for development centric topics which when coupled with standard classifiers
yields high classification accuracy.
The popularity of words can be described as how popular a word is for a certain
category of documents. For a given set of documents this metric determines a list of
words that occur most frequently in the document and is closely related to the topic.
The rarity metric, on the other hand, captures the list of least frequent terms that
are closely related to the topic. They leveraged the Linguistic Data Consortium (LDC)
data set to learn the frequency of occurrence of any n-gram on the Web to measure
rarity of any given term. Although the LDC data set that they used is slightly old, they
found that in a separate study it was observed that the rarity of terms with respect to
any category is preserved and the relation does not become obsolete.
Data Set and Experiments: The dataset they used for classification was the “4
University” set from WebKB. The data set contained pages from several universities
that were then grouped into seven categories namely: student, course, staff, faculty,
department, project and other. Since there can be ambiguity among some of the
20
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
Conclusion: With their feature extraction approach, they were able to get above
99% precision in rejecting the unrelated documents. They also got 95% recall for
selection of related documents. The standard classifiers, when implemented with their
feature extraction algorithms, gave some interesting results. Have a look at the figure
below representing how well standard classifiers did after incorporating their feature
extraction.
Introduction: In the work [5] done by Karel Fuka and Rudolf Hanka they stress
on how Feature set reduction for document classification is important and how doing
so in an organized manner can improve the accuracy of your designed model.
Feature Set Reduction: There are two ways in which feature reduction is
performed. These include:
21
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
Data Set and Experiments: The authors used “Reuters-21578” dataset for
training and testing the classifier. Some of the feature reduction techniques employed
by the authors were chi-squared statistic and PCA. The authors started with 3822
original terms in the beginning. The authors used χ 2 statistical data for feature selection
that gave an accuracy of approx. 81%. The authors used PCA to test feature extraction.
PCA, when applied to features obtained through χ 2 statistic, gave an accuracy of 86%
and PCA when applied over the complete feature set provided an accuracy of 95%
Conclusion: The authors concluded that all feature set reduction algorithms
perform better compared to no feature reduction. Also, appropriate feature extraction
algorithm can perform better than feature selection algorithm.
Introduction: In the paper [6] Myungsook Klassen and Nikhila Paturi present a
comparative study of web document classification. The author's prime focus for the
study was on the random forest. Apart from the random forest, the authors also used
22
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
Naïve Bayes, Multilayer perceptron, RBF networks and regression for classification.
The authors also studied the effect of the addition of topics to the accuracy of the model.
To test this, they performed experiments on data set containing 5 topics and 7 topics.
Data Set: The authors used Dmoz Open Directory Project (ODP) data set for
their experiments. The data set contains pre-classified web documents that are part of
the open content directory. The directory uses hierarchical ordering where each
document is listed in a category based on its content. The authors used 5 and 7
categories for their experimentations.
Experiments: In the first experiment, the authors analyzed the role of number
of trees (“numTrees”) and number of features (“numFeatures”) of random forest. The
23
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
authors used the tree depth of 0, that is the tree was allowed to grow as deep as possible.
The authors used 20, 30 and 40 number of trees in the random forest. 0 to 60 number
of features were used to test the classifier to detect which setting gives the highest
accuracy. For this experiment, the authors got the highest accuracy of 83.33% when the
number of trees was 20 and number of features were 50. They got this accuracy for 5
topics. For 7 topics, the best classification accuracy dropped to 80.95 % with 40 trees
and 35 features.
24
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
Conclusion: The authors concluded that even though other machine learning
algorithms performed well, random forest outperformed all the other algorithms. Also,
as the number of topics increased the accuracy of the other algorithms declined steeply
compared to the random forest. The authors also found that some of the other algorithms
were not scalable as well. For instance, the multilayer perceptron was performing ten
times slower than all the other algorithms.
Data Set: The authors used Reuters News data set for their comparative study.
As the name suggests, the Reuters-21578 dataset contains a collection of 21,578 news
items that are divided across 118 categories.
ANN: Artificial Neural Network imitates the actual working of neurons in the
human brain. In ANN, an impulse is modeled by a vector value, and change of impulse
is modeled using transfer function. A sigmoid, stepwise or even a linear function is
considered as a transfer function.
25
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
Data Set Preprocessing: The authors converted the SGML documents into
XML documents using the SX tool. The authors removed the documents that belonged
to no category or belonged to multiple categories. After this, the authors removed the
categories that had less than 15 documents left in it. The elimination left 63 categories
containing 11,327 documents.
26
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
2.8 Enhancing Naive Bayes with Various Smoothing Methods for Short Text
Classification
Introduction: In the work [8] done by Quan Yuan, Gao Cong and Nadia M.
Thalmann they experimented with the application of various smoothing techniques in
implementing the Naïve Bayes classifier for short text classification.
Naïve Bayes: In this time, millions of new documents get generated and
published every second. Hence, it is important that the employed classification method
be able to accommodate the new training data efficiently and to classify a new text
efficiently. The Naive Bayes (NB) method is known to be a robust, effective and
efficient technique for text classification. More importantly, it can accommodate new
incoming training data in classification models incrementally and efficiently.
27
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
where |ci| is the number of questions in ci, and |C| is the total number of questions in the
collection. For NB, likelihood p(wk|ci) is calculated by Laplace smoothing as follows:
where c(w, ci) is the frequency of word w in category ci, and |V| is the size of
vocabulary. For different smoothing methods, p(wk|ci) will be computed differently.
We consider the following four smoothing methods used in language models for
information retrieval. Let c(w, ci) denote the frequency of word w in category ci, and
p(w|C) be the maximum likelihood estimation of word w in collection C.
Data Set and Experimentation: They used Yahoo! Webscope dataset that
comprises of 3.9M questions belonging to 1,097 categories (e.g., travel, health) from a
Community-based Question Answering (CQA) service, as an example of short texts, to
study question topic classification.
They randomly selected 20% questions from each category of the whole dataset
as the test data. From the remaining 80% data, they generated 7 training data sets with
sizes of 1%, 5%, 10%, 20%, 40%, 60%, and 80% of the whole data, respectively. They
applied the smoothing algorithms discussed before and did a comparative study.
The purpose of the project work that we did was to determine, what is the best
way in which features of documents could be represented, to achieve improved
document classification accuracy.
29
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
For the objective mentioned above, We studied about how count vectorizer
works and how term frequency is evaluated for each document. We also studied how
to represent each document in the form of sparse matrix purely based on the term
frequency. We noted a flaw with this representation. The flaw is, irrespective of how
common or rare the word is we calculate just the frequency of a certain word in each
document. Thus, we assign a value to the feature of the document ( a feature in our case
being a word) purely based on how many times it occurs. We do not take into account
how much distinguishing the word is. To elaborate on this further let us put forward a
simple example.
Let us say we have a document that contains a word ‘catch’ 10 times and the
word ‘baseball’ 2 times. Here, if we just used term frequency, we will give more weight
to the word ‘catch’ compared to the word ‘baseball’ since it occurs more frequently in
the document. However, the word ‘catch’ might frequently be occurring across multiple
categories whereas the word ‘baseball’ might be occurring in very few categories that
are related to sports or baseball. Thus, the word ‘baseball’ is a more distinguishing
feature in the document. In inverse document frequency, we determine the
distinguishability of the word which we then multiply with term frequency to get the
new weight of each word in the document.
30
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
4 EXPERIMENTAL DESIGN
Almost all the documents across all categories contain words like ‘The,' ‘A,'
‘From,' ‘To,' etc. These words might hamper the classification task and might make
the classification results skewed. Removal of these words may give more accurate
classification results.
Some of the words originate from another word. In such cases, it would be
better if we consider all form of root words as the root word itself. Let us say we have
the following words with frequency in a given document; walk: 3, walked: 4,
walking: 6. Thus instead of considering all the three forms separately, we can
consider the root word walk with the frequency of 13 since all the three words signify
the same meaning in a different tense. Stemming of words before feature
representation can lead to better classification accuracy.
31
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
This is one of the most important experiments of the project work and core of
the hypothesis posited above. The count vectorizer considers the frequency of words
occurring in document irrespective of distinguishability of words in a document for
feature representation of documents. Instead, the documents could be better represented
if we consider the term frequency along with how much distinguishing the term is. Such
representation should also help improve the overall accuracy of the model.
After getting the feature representation of all documents, the classifier is trained
on the training set of documents (feature vectors of training documents).
The first step of our research/project work was determining the right data set.
We came across many data sets like Reuters data set and Yahoo data set. We selected
the 20 Newsgroup dataset collected by Ken Lang for our task. We selected this data
set because of several reasons. The reasons include: a) The data set is large, so working
with it is intriguing; b) The number of categories in the data set is 20 as opposed to
most of the data sets containing binary categories; c) The number of documents is quite
evenly divided among the 20 categories.
32
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
Organization: The data set has 20 categories of which some categories are very
closely related like the category of ‘talk.religion.misc’ and ‘soc.religion.christian'
Whereas some categories are entirely distinct like ‘rec.sport.baseball’ and ‘sci,space.'
The below table illustrates all the categories that comprise the data set.
Unlike other data sets that are generally CSV files containing a comma separated
values, which can be loaded easily, the task of loading the data set was a bit convoluted.
The data set contained two primary folders named as ‘train’ and ‘test.' Each folder
further contained 20 folders, one for each category of documents. Inside these folders
were the actual documents. Each category contained around 600 train documents and
around 400 test documents.
All the train documents were loaded into a single Bunch object which contained
the actual documents in a list. The Bunch object for train data also contained a list of
length equal to the length of documents list, containing the corresponding category of
each document. The Bunch object also contained a map object that maps the category
that is a string with an integer literal, thus representing category with an integer. All the
documents being added to the Bunch object is also shuffled to distribute them evenly
33
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
across categories when added to the list. Similarly, a Bunch object for Test data set was
also created.
After cleaning of the dataset, the next step is the creation of vocabulary.
Vocabulary is set of all words, which occur in training set of documents, at least once.
To better understand the concept, consider an example. Let us say we have two
documents D1 and D2.
D1: “A system of government in which priests’ rule in the name of God is termed as
Theocracy.” – “Christianity.”
D2: “A ball game played between two teams of nine on a field with a diamond-shaped
circuit of four bases is termed as Baseball.” – “Baseball.”
Here the Vocabulary V will be a set containing all words that occur at least once.
The Vocabulary formed will be;
V: {‘A’, ‘system’, ‘of’, ‘government’, ‘in’, ‘which’, ‘priests’’, ‘rule’, ‘the’, ‘name’,
‘God’, ‘is’, ‘termed’, ‘as’, ‘Theocracy’, ‘ball, ‘game’, ‘played’, ‘between’, ‘two’,
‘teams’, ‘nine’, ‘on’, ‘field’, ‘with’, ‘diamond’, ‘shaped’, ‘circuit’, ‘four’, ‘bases’,
‘Baseball’}
34
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
The problem with such Vocabulary is that it contains many stop words. Stop
words are the common English words that do not help in classification of documents at
all. Let us analyze the Vocabulary we just created. It already contains a lot of stop words
(SW).
SW: {‘A’, ‘of,' ‘in,' ‘which,' ‘the,' ‘is,' ‘as,' ‘on,' ‘with’}
These words have got no relation with either of the two categories and may
appear in all categories whose documents are under investigation. Thus, in Vocabulary
creation, these words will be removed. The stop word removed vocabulary (SWRV)
will contain all the words that occur at least once in the training set of documents except
the stop words. For our example, the stop words removed vocabulary will look like as
shown below.
Although SWRV will work better than the simpler V, it still can be improved by
using Stemming. Stemming is the process of converting inflected (changed form) words
to their stem words. Consider, for our example we get another document D3.
Thus, in our SWRC we must add the following words; {‘Square,' ‘shape,'
‘formed,' ‘four,' ‘edges’}. Notice that the word ‘shaped’ already exists in our SWRV.
The stem/root word of ‘shaped’ is ‘shape’ which we are trying to add now. Since both
the word convey the same meaning and come from the same root word, it does not
make sense to keep both in the vocabulary. This addition to the vocabulary could have
been avoided if we had performed stemming. Stemming not just help in limiting the
size of the Vocabulary but also helps in keeping the size of the feature vector in check
which we will see later in the report.
35
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
One of the simplest being a binary feature vector. In this method, for all the
words in the vocabulary, the words that occur in the document at least once, is counted
positive (1) whereas the words that do not occur is not counted (0). Thus, each
document is represented as a vector of words with values of each word mapped to either
0 (if it does not occur in that document) or 1 (if it does take place in the document).
Since a document may not contain a lot of words that are there in a dictionary,
it will have most of the words in feature vector with value ‘0’. Thus, there is a lot of
storage space wasted. To overcome this limitation, we make use of the sparse matrix.
In the sparse matrix, we store only the words whose value is non-zero, resulting in
significant storage saving.
Though Binary Feature Vector is one of the simplest, it does not perform that
well. It does capture, whether certain words exist in the document but it fails in
capturing the frequency of those words. For this reason, Count vectorizer (also termed
as Term Frequency vectorizer) is generally preferred.
In count vectorizer, we do not just capture the existence of words for a given
document but also capture how many times it occurs. Thus, for each word in a
vocabulary that occurs in a document we capture the number of times it occurs. Thus,
the document is represented as a vector of words along with the number of times it
occurs in the document. For this approach, also Sparse matrix is used since most of the
words in vocabulary will likely have a frequency of ‘0’.
36
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
The count vectorizer captures more detail than a simpler binary vectorizer, but
it also has a certain limitation. Although count vectorizer considers the frequency of
words occurring in a document, it does it irrespective of how rare or common the word
is. To overcome this limitation TfIdf (Term Frequency Inverse Document Frequency)
vectorizer can be used. TfIdf vectorizer does consider the inverse document frequency
(distinguishability weight of the word) along with the frequency of each word occurring
in a document, in forming the feature vector.
Let us say we have a document that contains a word ‘catch’ 10 times and the
word ‘baseball’ 2 times. Here, if we just used term frequency, we will give more weight
to the word ‘catch’ compared to the word ‘baseball’ since it occurs more frequently in
the document. However, the word ‘catch’ might frequently be occurring across multiple
categories whereas the word ‘baseball’ might be occurring in very few categories that
are related to sports or baseball. Thus, the word ‘baseball’ is a more distinguishing
feature in the document. In inverse document frequency, we determine the
distinguishability of the word which we then multiply with term frequency to get the
new weight of each word in the document. How the TfIdf is calculated is shown below:
TfIdf = TF * IDF
Thus, if the word occurs less across multiple documents than its
distinguishability or in more technical terms it’s IDF value will be high and the word
that frequently occurs across many documents will have a low IDF value. Thus, in TfIdf
37
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
the feature vector will not be based solely on term frequency of words but will be a
product of term frequency along with its IDF value.
5.6 Classification
For each kind of representation (Binary Vectorizer, Count Vectorizer, and TfIdf
Vectorizer) a Multinomial Naïve Bayes Classification model is generated. A
multinomial classifier is chosen because the feature set is multinomial in nature, that is
it can take a variable number. The classifier makes use of feature vector, and based on
it, learns to which class the document belongs. Thus, based on the feature vector of all
the training documents along with its target class the machine learning model is trained.
After the Classification model is built, the classifier is tested against the set of
test documents that account for 40% of the documents. To test documents that contain
unknown words Laplace smoothing is added.
Naïve Bayes classifier works on the assumption that all features are independent
and thus it takes the multiplicative product of the probabilities of each feature to
determine the likelihood for a given class. Thus, if there is any word that occurs in the
test document but is not a part of vocabulary (that is it never occurred in the training
document), then the probability of that feature will become 0. Since in Naive Bayes
multiplication of probabilities is done, because of probability of a single feature being
0 the complete result will become 0. Thus, even if the test document had significant and
discriminating features their result would be lost, and the document will become
Uncategorized with the likelihood of 0.
38
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
P1: (You, won, billion, in, lottery | Spam) = 0.80 which is very high and means the
document is spam.
Now we get another test document TD2 which is like the previous document
except it additionally contains the word ‘dollar’ which is not present in the
vocabulary(assumed).
Here, our classifier will again calculate all the conditional probabilities which
will result in .80 for spam except the word dollars for which the conditional probability
will be 0. Thus, here the probability of a whole document being ham or spam will
become 0. That is P (You, won, billion, dollars, in, lottery | Spam) = P(You, won,
billion, dollars, in, lottery | Ham) = 0.
The classifier is built and tested for all vectorizers, and their result is discussed
in the next section.
6 RESULT
This section illustrates the results obtained with various settings from the most
basic approach to the most advanced used in the project work. Thus, the section also
corroborates the need for the experiments discussed and how they help in improving
the model.
39
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
The model for document classification is tested against a test set of documents.
The effectiveness of the model is judged by employing the metrics described below.
Accuracy = (1 – µ / N) * 100%
The model is also tested for other metrics like Precision and Recall. Precision
(P) can be defined as the number of true positives (Tp) over the number of false
positives(Fp) plus the number of true positives (Tp).
Recall (R) is defined as the number of True Positives (Tp) over the number of
False Negatives (Fp) plus the number of True Positives (Tp).
These quantities are also related to the (F1) score, which is defined as the
harmonic mean of precision and recall.
All the metrics obtained range from 0 to 1 where 1 being the ideal and 0 being
the worst.
40
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
41
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
42
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
Binary vectorizer with and without stop words. The next two appendices describe the
classification report of Count vectorizer with and without stop words. Similarly the
last two appendices illustrate the classification report of TfIdf vectorizer with and
without stop words.
7 DISCUSSION OF RESULTS
As discussed in the Approach and Method section of the report, TfIdf vectorizer
performs better because it considers the distinguishability factor of each feature in
weighing compared to just the frequency of terms in count vectorizer or the mere
existence of a term in binary vectorizer. This makes the TfIdf vectorizer perform better
and represent the document in a more accurate way.
One interesting observation from the results is that if stop words in not removed
then, binary vectorizer does better than count vectorizer. Binary vectorizer got an
accuracy of 67% compared to the 62% of the count vectorizer. This is because a
document generally contains many stop words. Thus, if term frequency represents a
document, then the stop words are likely to become the most important feature of the
document. Since we know that stop words are never good discriminating criteria, it
results in misclassifications, leading to a drop in accuracy of predictions. On the other
43
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
hand, for binary vectorizer, stop words are registered as just 1, that is they exists in a
document. Thus, they do not overpower the significance of other discriminating words
that may be occurring in a document.
The TfIdf vectorizer performs well even when stop words are not removed. This
is because it considers the Idf (Inverse document frequency) value of each term as well
along with the frequency of the terms. Thus, even if we get the term frequency of some
stop word very high, the resulting TfIdf value of the term is decreased by the low Idf
value, since the TfIdf is a multiplication of Tf(Term frequency) and Idf (Inverse
document frequency).
We would like to conclude that even though the binary vectorizer and the count
vectorizer works well, it is the TfIdf vectorizer that outperforms the two in terms of
both appropriate representations of documents and the classification results. Although
Count vectorizer performs well and better than binary vectorizer, it performs poorly
than binary vectorizer, if stop words are not removed.
44
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
In order to test the classifier against the null hypothesis, one could perform
hypothesis testing using p-value. A p-value smaller than 0.05 indicates strong evidence
against null hypothesis and then the alternative hypothesis can be accepted.
We used just one algorithm, and that is Naïve Bayes for classification as the
main aim of our project work was to analyze the different types of feature representation
for documents. As a future work, We would suggest the researchers and students taking
the work forward to try and test the different feature representation schemes mentioned
with other machine learning algorithms like SVM, Neural Network, Expedition
maximization, Decision trees, etc.
We have not included in the report, another vectorizer that we tried and that is
Hashing Vectorizer. The hashing vectorizer although did not perform as well as TfIdf
vectorizer but was significantly faster than all the three vectorizer representation
mentioned in the report. It does not require the vocabulary to be present in memory all
the time, and thus it is space efficient as well. If the researcher wants to do document
classification in real time, We would suggest them to look into hashing vectorizer.
9 PROJECT SCHEDULE
The whole project work took around 3-4 months. Within this schedule, all the
tasks mentioned in the method and approach section of the report were carried out. A
very preliminary Naïve Bayes classifier was generated using Binary vectorizer by the
end of the fifth week. Rest of the tasks were completed in a span of 12 weeks after
which this report was written. A more detailed schedule is elaborated in the table below.
EXPERIMENTS WEEK
45
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
46
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
REFERENCES
47
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
[8] Quan Yuan, Gao Cong and Nadia M. Thalmann, “Enhancing Naive Bayes with
Various Smoothing Methods for Short Text Classification,” [Online]. Available:
http://www.ntu.edu.sg/home/gaocong/papers/wpp095-yuan.pdf
[9] “20 Newsgroup Data Set,” [Online]. Available:
http://qwone.com/~jason/20Newsgroups/
[10] Wikipedia: The free encyclopedia. (2004, July 22). FL: Wikimedia Foundation,
Inc. [Online]. Available: https://en.wikipedia.org/wiki/Document_classification
48
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
APPENDICES
49
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
50
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
51
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
52
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
53
REPORT ON DOCUMENT CLASSIFICATION USING MACHINE LEARNING
54