Minor Research Project Report
Minor Research Project Report
Submitted to
Principal Investigator
R. Vasanth Kumar Mehta
Asst. Professor
Department of Computer Science & Engineering
Page 1
Acknowledgement
Page 2
INDEX
1 Introduction 4
2 Chapter 1 - Algorithm 1 8
3 Chapter 2 – Algorithm 2 14
4 Annexures 23
Page 3
CHAPTER 1
INTRODUCTION
Page 4
1. Very large datasets - Challenge in Data Mining
Page 5
Figure 1. Map-Reduce Programming Model.
1. The computation takes a set of input key/value pairs, and produces a set of
output key/value pairs. The user of the MapReduce library expresses the
computation as two functions: Map and Reduce.
2. Map, written by the user, takes an input pair and produces a set of
intermediate key/value pairs. The MapReduce library groups together all
intermediate values associated with the same intermediate key I and passes
them to the Reduce function.
3. The Reduce function, also written by the user, accepts an intermediate key I
and a set of values for that key. It merges together these values to form a
possibly smaller set of values. Typically, just zero or one output value is
produced per Reduce invocation.
3. Project objectives
(b) To study the performance of the algorithms and compare them with
the sequential implementations
Two specific areas were chosen for the design and implementation of the parallel
algorithms and analysis:
a. Feature selection using distributed ensemble classifiers for very large
datasets
Page 6
b. Sentence-level Document clustering for Micro-level Contradiction Analysis
The above two problems were chosen in view of the huge size of data involved,
which is typically difficult to implement on a single node, and inherently suited for
processing on a cluster of machines.
Two independent algorithms were designed and developed and the same were
implemented. Experiments were carried out with specific datasets and the
outcomes were recorded. The findings were published in two conferences for
validation and acceptance by the wider academic and research community. This
report contains a description of the problem statement, the methodology used,
experimental data and outcomes of both the experiments.
Page 7
CHAPTER 2
ALGORITHM 1
Page 8
A. Introduction
Advances in data collection and storage capabilities during recent times have led to
an information overload. The number and size of Very Large Datasets is
constantly on the rise. Such datasets, in contrast with smaller, more traditional
datasets that have been studied extensively in the past, present new challenges in
data analysis. Traditional statistical methods break down partly because of the
increase in the number of observations, but mostly because of the increase in the
number of variables associated with each observation. The dimension of the data,
is the number of variables that are measured on each observation.
While the increasing dimensions and size of the dataset is posing a new challenge
to data miners, the solution to this challenge is available in two forms – one, in the
form of high performance computing capabilities through grids and clusters, and
second, in the form of data pre-processing techniques like feature selection that
seek to exploit the fact that not all the measured variables are of equal importance
for representing the underlying phenomena of interest.
The Decision Tree is a hierarchical model for supervised learning whereby the
local region is identified in a -sequence of recursive splits. Hall proposed a new
attribute weighted method based on the degree to which they depended on the
values of other attributes in order to overcome the drawback of the assumption of
independence of attributes in the Naïve Bayes method. Hall presented a filter
method that sets attribute weights for use with naive Bayes. The assumption made
is that the weight assigned to a predictive attribute should be inversely related to
the degree of dependency it has on other attributes. In the proposed algorithm, the
above mechanism is adapted to compute the dependencies of the features in a
distributed computing environment called Hadoop.
Page 9
C. Description of Algorithm
A random sampling is made from the training set and depending on the number of
data nodes in the Hadoop cluster, as many sample sets are made. Each node in the
cluster has a different sample set and will execute the following operation for
dependency computation to estimate the degree to which an attribute depends on
others:
b. note the minimum depth d at which the attribute is tested in the tree. The
deeper an attribute occurs in a tree, the more dependent it is. The depth of the root
is assumed to be 1 and the dependency D of an attribute can be measured using the
Eq.(1): D = 1 / d (1)
where d is the minimum depth at which the attribute is tested. Nodes with more
dependency will have a lesser D value and nodes with lesser dependency will have
a larger D value. If an attribute is not tested, we can assume its dependency as
zero.
c. Make a vector of the dependencies of all the attributes. This will be the
output of each node.
The vectors of estimated dependencies of the attributes from each of the nodes are
now combined by averaging them. This process of averaging helps in offsetting
any bias caused by data sampling. This averaged estimated dependency vector is
now an indicator of the relative dependencies of the attributes. At this stage, we
can take the number of features n to be selected / discarded as an input from the
user and accordingly, select either the top n attributes or the bottom n attributes as
per the sequence in which they occur in the averaged estimated dependency vector.
D. Experimental Results
The algorithm was implemented and tested on the Glass Dataset and the Diabetes
Dataset. The experiments were carried out on a Hadoop Cluster with 3 data nodes.
Page 10
Mapreduce programming is used for implementing the Feature Selection algorithm
as follows:
The Glass Dataset has 9 attributes and 214 instances. On applying the proposed
method using the C4.5 decision tree algorithm without pruning, dependency values
were obtained in each of the three clusters. The Features are re-ordered based on
the estimated dependency. The most insignificant features were removed and the
impact of this feature removal was measured by carrying out classification on the
reduced data set.
The above steps were repeated on the Diabetes Dataset.
The detailed results are provided in the enclosed research paper.
E. Conclusion
The results proved that the algorithm, while increasing the average precision and
recall significantly reduces the Tree size. Hence, the experimental results have
proved that the proposed solution can be used to significantly reduce the
Page 11
computing effort for performing data mining on very large data sets along with an
increase in accuracy as indicated by the values of higher precision and recall
obtained on the reduced dataset.
F. References
[1] Molina, L.C., Belanche, L., Nebot, A., 2002, Attribute Selection Algorithms: A
survey and experimental evaluation.Proceedings of 2nd IEEE's KDD, pp 306-
313.
[2] Guyon, I., Elisseeff, A.2003, An Introduction to Variable and Feature
Selection. Journal of Machine Learning Research 3, pp 1157-1182.
[3] Kohavi, R., John, G.H, 1997, Wrappers for feature subset selection. Artificial
Intelligence 97, pp 273-324
[4] Yu, L., Liu, H, 2003, Feature Selection for High-Dimensional Data: A Fast
Correlation-based Filter Solution. Proc. Int.Conference, ICML2003, pp 856-
863.
[5] Mark Hall, A decision tree-based attribute weighting filter for naive Bayes,
Knowledge-Based Systems, Volume 20, Issue 2, March 2007, Pages 120-126,
ISSN 0950-705.
[6] George H. John, Pat Langley: Estimating Continuous Distributions in Bayesian
Classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence,
San Mateo, 338-345, 1995.
[7] Jeffrey Dean and Sanjay Ghemawat , MapReduce: Simplified Data Processing
on Large Clusters. Communications of the ACM (2008),
Volume: 51, Issue: 1, Publisher: ACM, Pages: 1-13, ISSN: 00010782
[8] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An
Update; SIGKDD Explorations, Volume 11, Issue 1.
[9] B. German . & Vina Spiehler (1987). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
[10] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann,
1992.
[11] Michael Kahn, UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
Page 12
G. Publication
Based on the above work, the a research paper titled “ Feature Selection using
Distributed Ensemble Classifiers for Very Large Datasets” was written by
the principal investigator along with the guidance and co-authorship of Dr.S.
Rajalakshmi, Head, CSE Department, SCSVMV University and the paper
was presented in the 4th International Conference on Electronics Computer
Technology - ICECT 2012 held at Kanyakumari from 6-8 April 2012. The
paper is archived in the IEEE Xplore and indexed by Ei Compendex and ISI.
The full text of the paper is enclosed as an annexure for detailed reference. The
financial support provided to carry out the research is acknowledged in the
research paper.
Page 13
CHAPTER 3
ALGORITHM 2
Fuzzy-based
Sentence-level Document Clustering
for
Micro-level Contradiction Analysis
Page 14
A. Introduction
The general objective of document clustering is to form groups of documents with
similar themes. In discovering this similarity, the usual approach is to develop a
term-document matrix listing all the terms that are present document-wise, and
finding similarity of documents based on this matrix. While dimension reduction is
imperative for efficiently manipulating the massive quantity of data, to be useful,
this lower dimensional representation must be a good approximation of the
original document set given in its full space. This does not hold good for
documents that have contradictory statements within themselves. For instance, in
the healthcare domain, a report may contain the opinion of three doctors, and a
second report may contain the opinion of some other doctors. The contradictory
opinions of doctors in the two different reports may nullify each others’ effects and
it may result in a wrong clustering of the two documents.
C. Description of Algorithm
1. Standard pre-processing operations for data cleaning, dimension reduction
(like stemming, removal of stop words etc.)
2. For each document Di, i=1..n containing words Wi1,Wi2….Wik
a. For each sentence in Di, compare its similarity with every sentence
of Dj, j= i+1..n using a similarity measure and save the result in a
matrix.
Page 15
b. Depict the sentence similarity visually using the suggested
visualization scheme.
c. Generate a list of absolutely similar sentences and put them in
separate clusters
d. Generate the list of most contradictory sentences and put them in a
separate cluster
e. Evaluatee the similarity of the two documents using the fuzzy
threshold and output the value as similar documents, contradictory
documents or documents with no relationship, with a percentage of
similarity/contradiction.
f. Use the lists generated in steps b and c for all the subsequent
comparisons between the next set of documents, to avoid repetition
of comparisons.
g. Display the sentence similarities between documents
3. Perform document clustering using agglomerative clusteri
clustering
ng or any other
similar technique.
4. Display the document clusters, themes as well as contradictory documents.
Similarity Measurement
Every word in a sentence is matched against every other word in the other sentence
to find its best matching score. Then tthe
he sum of all the matching scores is taken,
and finally the sum is normalized by the total number of words in the two
sentences.
The content similarity can be quantified by the following formula:
CS=
Where s1 and s2 are the total counts of words in the two sentences and w(a,b)
∈ 0,1 is a term similarity function, which is defined as follows:
w =1, if a is same as b
=sim(a,b) where sim(a,b) is the semantic term similarity given by the
WordNet Similarity tool.
Page 16
Even prior to the above step, the two sentences are checked to see if they confirm
to the same sentiment, or they have opposite sentiments by identifying any
opposites present in them. For example, a sentence “it is easy to ride this path”
when compared with a sentence “the road is tough to drive” will be taken as
sentences with opposite polarities. In such cases, the opposite terms, namely easy
and tough will be removed and the similarity of the sentences will be compared as
described above. The resultant similarity value CS, which lies in the range of [0,1]
will be taken with the opposite sign i.e. with the negative sign. This will help
identify the most contradictory statements.
At the end of this similarity comparison, we get a matrix where every sentence is
matched with every other sentence in the two documents and their content
similarity now lies in the range [-1,1]. The content similarity can now be
interpreted as follows:
CS (S1, S2) < 0 => Sentences S1, S2 are contradictory
=0 => Sentences S1, S2 are unrelated
>0=> Sentences S1, S2 are similar
The similarity increases as we move towards 1 and the contradiction increases as
we move towards -1. The similarity/contradiction can be represented visually by
-1 0 1
Contradiction Similarity
At this stage, we build an absolute similarity cluster to which all the absolutely
similar sentences (i.e. with CS=1) are added along with the Document id. A similar
absolute contradiction cluster is built for sentences with CS=-1.
As an additional step to present the results in an easily interpretable manner, a
visualization of the similarity measurements between the sentences with a coloring
scheme is suggested where the content similarity value is converted to a RGB
value such that the most contradictory statements are indicated in red, neutral ones
are yellow and similarity increases from yellow to green, with most similar
sentences depicted in dark green. The formula to compute colour is as follows:
Page 17
[R, G, B] = [255*|CS|, 255-255*|CS|, 0], if CS<0
= [255-255*|CS|, 255*|CS|, 0], if CS>0
= [255, 255, 0], if CS=0
Where CS represents the content similarity between the two sentences.
Every sentence is paired up with every other statement, helping us identify a
sentence’s most similar or contradictory statement very easily.
Once the sentence-level relationships have been measured in detail, the next step is to
consolidate the same to infer about the relationship between the two documents. This is
computed as follows:
DS = 1/ CSSk, PSk
where DS is the document similarity
n is the number of sentences in the larger document
S represents a sentence and
PS represents a paired sentence. A paired sentence is the most similar or most
contradictory sentence from the other document.
The above summation consolidates the effect of the individual sentences in the two
documents. By virtue of normalization by dividing this summation with n, DS, the
Document Similarity will lie in the range of [-1, 1] as the value of CS lies in the range of
[-1, 1] as shown earlier. This Document Similarity Matrix will be of the following form:
D1 D2 D3 D4
D1 1
D2 0.8 1
D3 -0.7 0.4 1
D4 0.2 0.1 -0.8 1
The above matrix is symmetric and hence, only the values below the diagonal are shown.
The relationship between the documents can be graphically represented as follows:
(D3,D4) (D1,D3) (D2,D4)
(D1,D4) (D2,D3) (D1,D2)
-1 0
1
Fuzziness is inherent in classifying two documents as contradictory or similar ones.
Hence, we use a fuzzy inference scheme where the following thresholds are set to
interpret the obtained value of Document Similarity DS between two documents:
-0.8 <DS <= -1 =>Documents are highly contradictory
-0.3 <DS <= -0.8 =>Documents are fairly contradictory
0<DS <= -0.3 =>Documents are slightly contradictory
DS= 0 => Documents have no relationship
Page 18
0>DS >= 0.3 => Documents are Slightly Similar
0.3>DS >=0.8 => Documents are Fairly Similar
0.8>DS >= 1 => Documents are Highly Similar
The absolute value of DS can be interpreted as the percentage of similarity between any
two documents. Again, we could use the visualization scheme suggested earlier for the
sentence level similarity depiction to visualize the Document Similarity also.
The above scheme is used to compare the document similarity to perform the document
clustering using some standard technique like agglomerative clustering iteratively as
follows:
Using the Document Similarity Matrix shown above, the two most similar documents are
combined to form a cluster and the process is repeated till no more elements are left for
combination. For example
Step 1: The documents D2 and D1 are combined since they have max. similarity of 0.8.
Step 2: In the next step, the average similarity of other documents from D2 and D1 is
computed. For example, D3 to {D1, D2} is -0.15. D4 to {D1, D2} is 0.15. D3 to D4 is -
0.8. Among these three, we find that D4 is similar to {D1, D2}. Hence, D4 is combined.
Step 3: In the next step, when we compare D3, we find that it is contradictory. Hence, we
treat it as a separate cluster.
D. Analysis
The dominant step in the algorithm is the sentence-level clustering step where
sentence in a document is compared with every other sentence from all documents.
Let each document Di have Sj sentences and let each sentence have Wk words in
it. Assuming there are n documents, the first document will be compared with n-1
documents, the second with n-2, the third with n-3 and so on. In each comparison,
the number of word comparisons involved will be product of the number of words
in first document with the number of words in second document (i.e.)
Hence, the number of comparisons
- for first document will be w1 (w2+w3+w4…..wn) where n is the number of
documents.
- for second document will be w2 (w3+w4…..wn) where n is the number of
documents
Page 19
In general, for Document i, the number of comparisons will be
Wi(Wi+1, Wi+2…..Wn) ie. ∗ ∑
The overall number of comparisons for all documents will be
Wi ∗
Page 20
under the domain of Data Quality Mining, which refers to the use of Data
Mining Techniques to enhance the quality of data, especially in the
preprocessing stage. Further, this same preprocessing task can be applied to
several other text mining tasks like summarization etc. This preprocessing
task can be done in parallel for all the documents.
c. With a marginal loss of accuracy, we can approximate the criterion for
formation of the absolutely similar or contradictory clusters. While we have
considered two sentences to be absolutely similar only if their Content
Similarity is =1 and to be absolutely contradictory only if their Content
Similarity is =-1, we can relax this to the extent required, leading to more
sentences being clustered as absolutely similar or contradictory ones. This
leads to an increase in the speed with corresponding loss of accuracy.
d. Other indirect means can be applied to enhance the turnaround time of the
algorithm. The algorithm can be easily adapted to perform distributed
matching using some distributed programming techniques like Hadoop’s
map-reduce and hence, high performance computing systems can be
exploited to obtain quick results.
E. Conclusion
The experimental results have shown that the algorithm is quite successful in
finding the fully-correct and fully-wrong answers. In most of the cases, it follows
the general trends as indicated by the manual evaluation process. The suggested
algorithm has multiple benefits as indicated earlier for analysis of text documents
as well as for preprocessing text documents for further analysis. Tasks that can be
performed include document summarization, in contradiction analysis, in visual
depiction of intra- and inter-document similarities. The algorithm can be further
extended to work in a distributed environment.
F. References
[1] I Dhillon et al. 2004. Feature selection and document clustering. Survey of
Text Mining: Clustering, classification and retrieval, Vol. 1 . 74-100.
[2] Farial Shahnaz, Micael W. Berry, V. Paul Pauca. 2006. Document Clustering
using nonnegative matrix function. Information Processing & Management.
Volume 42, Issue 2. Pages 373–386
Page 21
[3] H. Liu., H. Motoda. 2010. Feature Selection – An ever evolving frontier in
Data Mining. Journal of Machine Learning Research, Workshop and
Conference Proceeding.
[4] Michael J. Paul, ChengXiang Zhai, and Roxana Girju. 2010. Summarizing
contrastive viewpoints in opinionated text. In Proceedings of the 2010
Conference on Empirical Methods in Natural Language Processing (EMNLP
'10). Association for Computational Linguistics, Stroudsburg, PA, USA, 66-76.
[5] Hyun Duk Kim, Cheng Xiang Zhai, 2009. Generating Comparative Summaries
of Contradictory Opinions in Text. In Proceedings of CIKM’09, November 2–
6, 2009, Hong Kong, China.
[6] P. Achananuparp, X. Hu, and X. Shen. 2008 The evaluation of sentence
similarity measures. DaWaK ’08: Proceedings of the 10th international
conference on Data Warehousing and Knowledge Discovery. 305–316.
Springer Verlag
[7] Timothy J Ross. 2008. Fuzzy Logic with Engineering Applications. Second
Edition. Wiley India.
[8] H.Liu. 2005. Toward integrating feature selection algorithms for classification
and clustering., IEEE Transactions on Knowledge and Data Engineering
[9] J Hipp, U Guntzer et al. 2001. Data Quality Mining – Making a Virtue of
Necessity. In Proc. of SIGMOD DMKD Workshop.
[10] J.Dean, 2008 MapReduce: simplified data processing on large clusters.
Communications of the ACM - 50th anniversary issue: 1958 - 2008 ACM
Press Frontier Series. ACM, New York, NY, 19-33. DOI=
http://doi.acm.org/10.1145/1327452.1327
G. Publication
Based on the above work, the a research paper titled “An algorithm for fuzzy-
based Sentence-level Document Clustering for Micro-level Contradiction
Analysis” was written by the principal investigator along with the guidance and
co-authorship of Dr.S. Rajalakshmi, Head, CSE Department, SCSVMV University
and the paper was presented in the International Conference on Advances in
Computing, Communications and Informatics held in RMK Engineering College
Chennai from 2-4 Aug. 2012. The paper is indexed in the ACM Digital
Library(http://dl.acm.org/citation.cfm?id=2345413).
The full text of the paper is enclosed as an annexure for detailed reference. The
financial support provided to carry out the research is acknowledged in the
research paper.
Page 22
ANNEXURES
Page 23