0% found this document useful (0 votes)
54 views

Minor Research Project Report

The document describes an algorithm for feature selection from large datasets using distributed ensemble classifiers on Hadoop. It involves randomly sampling the dataset across cluster nodes, building decision trees on the samples to compute attribute dependencies, averaging the dependencies across nodes, and selecting features based on the averaged dependencies. The algorithm was implemented and tested on glass and diabetes datasets on a 3 node Hadoop cluster.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Minor Research Project Report

The document describes an algorithm for feature selection from large datasets using distributed ensemble classifiers on Hadoop. It involves randomly sampling the dataset across cluster nodes, building decision trees on the samples to compute attribute dependencies, averaging the dependencies across nodes, and selecting features based on the averaged dependencies. The algorithm was implemented and tested on glass and diabetes datasets on a 3 node Hadoop cluster.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Report on

Minor Research Project


on

Design and Development of


Parallel Data Mining Algorithms
using Hadoop

Submitted to

Sri Chandrasekharendra Saraswathi Viswa Maha Vidyalaya


(University established under section 3 of the UGC act, 1956)
Enathur, Kanchipuram – 631561, Tamilnadu.

Principal Investigator
R. Vasanth Kumar Mehta
Asst. Professor
Department of Computer Science & Engineering

Page 1
Acknowledgement

I thank the University authorities for providing encouragement and financial


assistance to carry out this Minor Research Project. I thank the Dean,
Engineering and Technology for his advise. My Research Guide and Head
of the Department Dr. S. Rajalakshmi has provided vital inputs, especially in
writing the research papers. I thank her for the same. My thanks to other
colleagues and technical staff of the CSE Department who have helped me in
carrying out this project.

Page 2
INDEX

S.No. Topics Page

1 Introduction 4

2 Chapter 1 - Algorithm 1 8

3 Chapter 2 – Algorithm 2 14

4 Annexures 23

Page 3
CHAPTER 1

INTRODUCTION

Page 4
1. Very large datasets - Challenge in Data Mining

Data Mining is the process of nontrivial extraction of implicit, previously


unknown, and potentially useful information from data. A common feature
of most data mining tasks is that they are resource intensive and operate on
large sets of data. Data sources measuring in gigabytes or terabytes are now
quite common in data mining. For example, WalMart records 20 million
transactions and AT&T produces 275 million call records every day. This
calls for fast data mining algorithms that can mine huge databases in a
reasonable amount of time.

However, despite the many algorithmic improvements proposed in many


serial algorithms, the large size and dimensionality of many databases
makes data mining tasks too slow and too big to be run on a single
processor machine. It is therefore a growing need to develop efficient
parallel data mining algorithms that can run on a distributed system.

2. Distributed Data Mining – A Solution

Hadoop is an open-source Java-based software platform developed by the


Apache Software Foundation . It lets one easily write and run distributed
applications on large computer clusters to process vast amounts of data.
Hadoop implements Google’s MapReduce programming model on top of a
distributed file system called the Hadoop Distributed File System (HDFS).
MapReduce divides applications into many small blocks of work. HDFS
creates multiple replicas of data blocks for reliability, placing them on
compute nodes around the cluster. MapReduce can then process the data
where it is located.

Page 5
Figure 1. Map-Reduce Programming Model.

The authors describe the MapReduce programming model as follows:

1. The computation takes a set of input key/value pairs, and produces a set of
output key/value pairs. The user of the MapReduce library expresses the
computation as two functions: Map and Reduce.
2. Map, written by the user, takes an input pair and produces a set of
intermediate key/value pairs. The MapReduce library groups together all
intermediate values associated with the same intermediate key I and passes
them to the Reduce function.
3. The Reduce function, also written by the user, accepts an intermediate key I
and a set of values for that key. It merges together these values to form a
possibly smaller set of values. Typically, just zero or one output value is
produced per Reduce invocation.

3. Project objectives

The objectives of project are to:


(a) implement parallel Data Mining algorithms on Hadoop to faciliate
the analysis of massive data sets like those generated in the telecom
industry to produce results like the customer churn details and

(b) To study the performance of the algorithms and compare them with
the sequential implementations

Two specific areas were chosen for the design and implementation of the parallel
algorithms and analysis:
a. Feature selection using distributed ensemble classifiers for very large
datasets

Page 6
b. Sentence-level Document clustering for Micro-level Contradiction Analysis

The above two problems were chosen in view of the huge size of data involved,
which is typically difficult to implement on a single node, and inherently suited for
processing on a cluster of machines.

4. Outcome and deliverables

Two independent algorithms were designed and developed and the same were
implemented. Experiments were carried out with specific datasets and the
outcomes were recorded. The findings were published in two conferences for
validation and acceptance by the wider academic and research community. This
report contains a description of the problem statement, the methodology used,
experimental data and outcomes of both the experiments.

Page 7
CHAPTER 2

ALGORITHM 1

Feature Selection using


Distributed Ensemble Classifiers
for
Very Large Datasets

Page 8
A. Introduction
Advances in data collection and storage capabilities during recent times have led to
an information overload. The number and size of Very Large Datasets is
constantly on the rise. Such datasets, in contrast with smaller, more traditional
datasets that have been studied extensively in the past, present new challenges in
data analysis. Traditional statistical methods break down partly because of the
increase in the number of observations, but mostly because of the increase in the
number of variables associated with each observation. The dimension of the data,
is the number of variables that are measured on each observation.
While the increasing dimensions and size of the dataset is posing a new challenge
to data miners, the solution to this challenge is available in two forms – one, in the
form of high performance computing capabilities through grids and clusters, and
second, in the form of data pre-processing techniques like feature selection that
seek to exploit the fact that not all the measured variables are of equal importance
for representing the underlying phenomena of interest.

B. Abstract of Proposed Solution


The proposed algorithm paper blends these two solutions to the grand challenge of
mining very large datasets through a process of feature selection that is
implemented on a cluster of systems.

The Decision Tree is a hierarchical model for supervised learning whereby the
local region is identified in a -sequence of recursive splits. Hall proposed a new
attribute weighted method based on the degree to which they depended on the
values of other attributes in order to overcome the drawback of the assumption of
independence of attributes in the Naïve Bayes method. Hall presented a filter
method that sets attribute weights for use with naive Bayes. The assumption made
is that the weight assigned to a predictive attribute should be inversely related to
the degree of dependency it has on other attributes. In the proposed algorithm, the
above mechanism is adapted to compute the dependencies of the features in a
distributed computing environment called Hadoop.

Page 9
C. Description of Algorithm
A random sampling is made from the training set and depending on the number of
data nodes in the Hadoop cluster, as many sample sets are made. Each node in the
cluster has a different sample set and will execute the following operation for
dependency computation to estimate the degree to which an attribute depends on
others:

a. construct an unpruned decision tree from the training data

b. note the minimum depth d at which the attribute is tested in the tree. The
deeper an attribute occurs in a tree, the more dependent it is. The depth of the root
is assumed to be 1 and the dependency D of an attribute can be measured using the
Eq.(1): D = 1 / d (1)

where d is the minimum depth at which the attribute is tested. Nodes with more
dependency will have a lesser D value and nodes with lesser dependency will have
a larger D value. If an attribute is not tested, we can assume its dependency as
zero.

c. Make a vector of the dependencies of all the attributes. This will be the
output of each node.

The vectors of estimated dependencies of the attributes from each of the nodes are
now combined by averaging them. This process of averaging helps in offsetting
any bias caused by data sampling. This averaged estimated dependency vector is
now an indicator of the relative dependencies of the attributes. At this stage, we
can take the number of features n to be selected / discarded as an input from the
user and accordingly, select either the top n attributes or the bottom n attributes as
per the sequence in which they occur in the averaged estimated dependency vector.

D. Experimental Results
The algorithm was implemented and tested on the Glass Dataset and the Diabetes
Dataset. The experiments were carried out on a Hadoop Cluster with 3 data nodes.

Page 10
Mapreduce programming is used for implementing the Feature Selection algorithm
as follows:

A. In the Map Phase


a. Collect a random sample from the training dataset into memory from a
sequencefile
b. Build an unpruned decision tree on the sample.
c. Compute the dependency value D for each attribute and generate the
Dependency Vector
c. Write the Dependency Vector to the filesystem.
The Sequence file is a Hadoop-specific compressed binary file format. It’s
optimized for passing data between the output of one MapReduce job to the input
of some other MapReduce job.

B. In the Reduce Phase


1) Iterate over each Dependency Vector and calculate the average vector.
2) Save the vector into a SequenceFile.

The Glass Dataset has 9 attributes and 214 instances. On applying the proposed
method using the C4.5 decision tree algorithm without pruning, dependency values
were obtained in each of the three clusters. The Features are re-ordered based on
the estimated dependency. The most insignificant features were removed and the
impact of this feature removal was measured by carrying out classification on the
reduced data set.
The above steps were repeated on the Diabetes Dataset.
The detailed results are provided in the enclosed research paper.

E. Conclusion
The results proved that the algorithm, while increasing the average precision and
recall significantly reduces the Tree size. Hence, the experimental results have
proved that the proposed solution can be used to significantly reduce the

Page 11
computing effort for performing data mining on very large data sets along with an
increase in accuracy as indicated by the values of higher precision and recall
obtained on the reduced dataset.

F. References
[1] Molina, L.C., Belanche, L., Nebot, A., 2002, Attribute Selection Algorithms: A
survey and experimental evaluation.Proceedings of 2nd IEEE's KDD, pp 306-
313.
[2] Guyon, I., Elisseeff, A.2003, An Introduction to Variable and Feature
Selection. Journal of Machine Learning Research 3, pp 1157-1182.
[3] Kohavi, R., John, G.H, 1997, Wrappers for feature subset selection. Artificial
Intelligence 97, pp 273-324
[4] Yu, L., Liu, H, 2003, Feature Selection for High-Dimensional Data: A Fast
Correlation-based Filter Solution. Proc. Int.Conference, ICML2003, pp 856-
863.
[5] Mark Hall, A decision tree-based attribute weighting filter for naive Bayes,
Knowledge-Based Systems, Volume 20, Issue 2, March 2007, Pages 120-126,
ISSN 0950-705.
[6] George H. John, Pat Langley: Estimating Continuous Distributions in Bayesian
Classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence,
San Mateo, 338-345, 1995.
[7] Jeffrey Dean and Sanjay Ghemawat , MapReduce: Simplified Data Processing
on Large Clusters. Communications of the ACM (2008),
Volume: 51, Issue: 1, Publisher: ACM, Pages: 1-13, ISSN: 00010782
[8] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An
Update; SIGKDD Explorations, Volume 11, Issue 1.
[9] B. German . & Vina Spiehler (1987). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
[10] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann,
1992.
[11] Michael Kahn, UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.

Page 12
G. Publication

Based on the above work, the a research paper titled “ Feature Selection using
Distributed Ensemble Classifiers for Very Large Datasets” was written by
the principal investigator along with the guidance and co-authorship of Dr.S.
Rajalakshmi, Head, CSE Department, SCSVMV University and the paper
was presented in the 4th International Conference on Electronics Computer
Technology - ICECT 2012 held at Kanyakumari from 6-8 April 2012. The
paper is archived in the IEEE Xplore and indexed by Ei Compendex and ISI.

The full text of the paper is enclosed as an annexure for detailed reference. The
financial support provided to carry out the research is acknowledged in the
research paper.

Page 13
CHAPTER 3

ALGORITHM 2

Fuzzy-based
Sentence-level Document Clustering
for
Micro-level Contradiction Analysis

Page 14
A. Introduction
The general objective of document clustering is to form groups of documents with
similar themes. In discovering this similarity, the usual approach is to develop a
term-document matrix listing all the terms that are present document-wise, and
finding similarity of documents based on this matrix. While dimension reduction is
imperative for efficiently manipulating the massive quantity of data, to be useful,
this lower dimensional representation must be a good approximation of the
original document set given in its full space. This does not hold good for
documents that have contradictory statements within themselves. For instance, in
the healthcare domain, a report may contain the opinion of three doctors, and a
second report may contain the opinion of some other doctors. The contradictory
opinions of doctors in the two different reports may nullify each others’ effects and
it may result in a wrong clustering of the two documents.

B. Abstract of Proposed Solution


The proposed algorithm instead of using the whole term-document matrix for
comparing documents, uses only a subset of it to perform sentence-level
comparisons, thereby identifying micro-level contradictions that are present in the
documents. In this process, we not only cluster the documents correctly, we also
identify the contradictory documents, as well as the reason and extent of
contradiction between documents. Fuzzy-logic based approach is used to draw
conclusions about the similarity of two documents.

C. Description of Algorithm
1. Standard pre-processing operations for data cleaning, dimension reduction
(like stemming, removal of stop words etc.)
2. For each document Di, i=1..n containing words Wi1,Wi2….Wik
a. For each sentence in Di, compare its similarity with every sentence
of Dj, j= i+1..n using a similarity measure and save the result in a
matrix.

Page 15
b. Depict the sentence similarity visually using the suggested
visualization scheme.
c. Generate a list of absolutely similar sentences and put them in
separate clusters
d. Generate the list of most contradictory sentences and put them in a
separate cluster
e. Evaluatee the similarity of the two documents using the fuzzy
threshold and output the value as similar documents, contradictory
documents or documents with no relationship, with a percentage of
similarity/contradiction.
f. Use the lists generated in steps b and c for all the subsequent
comparisons between the next set of documents, to avoid repetition
of comparisons.
g. Display the sentence similarities between documents
3. Perform document clustering using agglomerative clusteri
clustering
ng or any other
similar technique.
4. Display the document clusters, themes as well as contradictory documents.
Similarity Measurement
Every word in a sentence is matched against every other word in the other sentence
to find its best matching score. Then tthe
he sum of all the matching scores is taken,
and finally the sum is normalized by the total number of words in the two
sentences.
The content similarity can be quantified by the following formula:

CS=
Where s1 and s2 are the total counts of words in the two sentences and w(a,b)
∈ 0,1 is a term similarity function, which is defined as follows:
w =1, if a is same as b
=sim(a,b) where sim(a,b) is the semantic term similarity given by the
WordNet Similarity tool.

Page 16
Even prior to the above step, the two sentences are checked to see if they confirm
to the same sentiment, or they have opposite sentiments by identifying any
opposites present in them. For example, a sentence “it is easy to ride this path”
when compared with a sentence “the road is tough to drive” will be taken as
sentences with opposite polarities. In such cases, the opposite terms, namely easy
and tough will be removed and the similarity of the sentences will be compared as
described above. The resultant similarity value CS, which lies in the range of [0,1]
will be taken with the opposite sign i.e. with the negative sign. This will help
identify the most contradictory statements.
At the end of this similarity comparison, we get a matrix where every sentence is
matched with every other sentence in the two documents and their content
similarity now lies in the range [-1,1]. The content similarity can now be
interpreted as follows:
CS (S1, S2) < 0 => Sentences S1, S2 are contradictory
=0 => Sentences S1, S2 are unrelated
>0=> Sentences S1, S2 are similar
The similarity increases as we move towards 1 and the contradiction increases as
we move towards -1. The similarity/contradiction can be represented visually by

-1 0 1

Contradiction Similarity

At this stage, we build an absolute similarity cluster to which all the absolutely
similar sentences (i.e. with CS=1) are added along with the Document id. A similar
absolute contradiction cluster is built for sentences with CS=-1.
As an additional step to present the results in an easily interpretable manner, a
visualization of the similarity measurements between the sentences with a coloring
scheme is suggested where the content similarity value is converted to a RGB
value such that the most contradictory statements are indicated in red, neutral ones
are yellow and similarity increases from yellow to green, with most similar
sentences depicted in dark green. The formula to compute colour is as follows:

Page 17
[R, G, B] = [255*|CS|, 255-255*|CS|, 0], if CS<0
= [255-255*|CS|, 255*|CS|, 0], if CS>0
= [255, 255, 0], if CS=0
Where CS represents the content similarity between the two sentences.
Every sentence is paired up with every other statement, helping us identify a
sentence’s most similar or contradictory statement very easily.
Once the sentence-level relationships have been measured in detail, the next step is to
consolidate the same to infer about the relationship between the two documents. This is
computed as follows:


DS = 1/ CSSk, PSk

where DS is the document similarity
n is the number of sentences in the larger document
S represents a sentence and
PS represents a paired sentence. A paired sentence is the most similar or most
contradictory sentence from the other document.
The above summation consolidates the effect of the individual sentences in the two
documents. By virtue of normalization by dividing this summation with n, DS, the
Document Similarity will lie in the range of [-1, 1] as the value of CS lies in the range of
[-1, 1] as shown earlier. This Document Similarity Matrix will be of the following form:
D1 D2 D3 D4
D1 1
D2 0.8 1
D3 -0.7 0.4 1
D4 0.2 0.1 -0.8 1
The above matrix is symmetric and hence, only the values below the diagonal are shown.
The relationship between the documents can be graphically represented as follows:
(D3,D4) (D1,D3) (D2,D4)
(D1,D4) (D2,D3) (D1,D2)
-1 0
1
Fuzziness is inherent in classifying two documents as contradictory or similar ones.
Hence, we use a fuzzy inference scheme where the following thresholds are set to
interpret the obtained value of Document Similarity DS between two documents:
-0.8 <DS <= -1 =>Documents are highly contradictory
-0.3 <DS <= -0.8 =>Documents are fairly contradictory
0<DS <= -0.3 =>Documents are slightly contradictory
DS= 0 => Documents have no relationship

Page 18
0>DS >= 0.3 => Documents are Slightly Similar
0.3>DS >=0.8 => Documents are Fairly Similar
0.8>DS >= 1 => Documents are Highly Similar
The absolute value of DS can be interpreted as the percentage of similarity between any
two documents. Again, we could use the visualization scheme suggested earlier for the
sentence level similarity depiction to visualize the Document Similarity also.

The above scheme is used to compare the document similarity to perform the document
clustering using some standard technique like agglomerative clustering iteratively as
follows:

Using the Document Similarity Matrix shown above, the two most similar documents are
combined to form a cluster and the process is repeated till no more elements are left for
combination. For example

Step 1: The documents D2 and D1 are combined since they have max. similarity of 0.8.

Step 2: In the next step, the average similarity of other documents from D2 and D1 is
computed. For example, D3 to {D1, D2} is -0.15. D4 to {D1, D2} is 0.15. D3 to D4 is -
0.8. Among these three, we find that D4 is similar to {D1, D2}. Hence, D4 is combined.

Step 3: In the next step, when we compare D3, we find that it is contradictory. Hence, we
treat it as a separate cluster.

D. Analysis
The dominant step in the algorithm is the sentence-level clustering step where
sentence in a document is compared with every other sentence from all documents.
Let each document Di have Sj sentences and let each sentence have Wk words in
it. Assuming there are n documents, the first document will be compared with n-1
documents, the second with n-2, the third with n-3 and so on. In each comparison,
the number of word comparisons involved will be product of the number of words
in first document with the number of words in second document (i.e.)
Hence, the number of comparisons
- for first document will be w1 (w2+w3+w4…..wn) where n is the number of
documents.
- for second document will be w2 (w3+w4…..wn) where n is the number of
documents

Page 19
In general, for Document i, the number of comparisons will be
Wi(Wi+1, Wi+2…..Wn) ie.  ∗ ∑ 
The overall number of comparisons for all documents will be
 

Wi ∗  
 

where n is the number of documents


While we can justify the high number of comparisons in view of the critical
applications to which this algorithm will be applied, we have suggested here some
steps that can greatly reduce the comparisons by optimizing the algorithm and
further, a few other factors that will speed-up the algorithm:
a. In each stage of comparison in the algorithm, an absolutely similar
sentences cluster and an absolutely contradictory sentences cluster are
generated. This means that instead of comparing every sentence in each
document with every other sentence, we can compare sentence with only
any one representative sentence from the absolutely similar cluster and any
one representative sentence from the absolutely contradictory cluster,
without any loss of information or accuracy. This will greatly reduce the
number of word comparisons required, especially as the algorithm
progresses as the two sentence similarity and contradiction clusters will
keep increasing in size with increase in number of completed comparisons.
As an example, if we have already found that D1 has a sentence “Details are
sketchy” and “The amount of information available is hazy” will be in a
similar sentence cluster. A new sentence can be compared with any one of
the above two to measure the similarity.
b. An interesting preprocessing task can be performed on every document with
this very algorithm as a feature-reduction step. This can be done by
comparing each document with itself. The absolutely similar cluster and the
absolutely contradictory cluster will have sentence sets that are conveying
the same information redundantly. Hence, we can retain only one
representative from each of the above sentence sets, thereby leading to
feature reduction at sentence level. Such a preprocessing activity would fall

Page 20
under the domain of Data Quality Mining, which refers to the use of Data
Mining Techniques to enhance the quality of data, especially in the
preprocessing stage. Further, this same preprocessing task can be applied to
several other text mining tasks like summarization etc. This preprocessing
task can be done in parallel for all the documents.
c. With a marginal loss of accuracy, we can approximate the criterion for
formation of the absolutely similar or contradictory clusters. While we have
considered two sentences to be absolutely similar only if their Content
Similarity is =1 and to be absolutely contradictory only if their Content
Similarity is =-1, we can relax this to the extent required, leading to more
sentences being clustered as absolutely similar or contradictory ones. This
leads to an increase in the speed with corresponding loss of accuracy.
d. Other indirect means can be applied to enhance the turnaround time of the
algorithm. The algorithm can be easily adapted to perform distributed
matching using some distributed programming techniques like Hadoop’s
map-reduce and hence, high performance computing systems can be
exploited to obtain quick results.
E. Conclusion
The experimental results have shown that the algorithm is quite successful in
finding the fully-correct and fully-wrong answers. In most of the cases, it follows
the general trends as indicated by the manual evaluation process. The suggested
algorithm has multiple benefits as indicated earlier for analysis of text documents
as well as for preprocessing text documents for further analysis. Tasks that can be
performed include document summarization, in contradiction analysis, in visual
depiction of intra- and inter-document similarities. The algorithm can be further
extended to work in a distributed environment.

F. References
[1] I Dhillon et al. 2004. Feature selection and document clustering. Survey of
Text Mining: Clustering, classification and retrieval, Vol. 1 . 74-100.
[2] Farial Shahnaz, Micael W. Berry, V. Paul Pauca. 2006. Document Clustering
using nonnegative matrix function. Information Processing & Management.
Volume 42, Issue 2. Pages 373–386

Page 21
[3] H. Liu., H. Motoda. 2010. Feature Selection – An ever evolving frontier in
Data Mining. Journal of Machine Learning Research, Workshop and
Conference Proceeding.
[4] Michael J. Paul, ChengXiang Zhai, and Roxana Girju. 2010. Summarizing
contrastive viewpoints in opinionated text. In Proceedings of the 2010
Conference on Empirical Methods in Natural Language Processing (EMNLP
'10). Association for Computational Linguistics, Stroudsburg, PA, USA, 66-76.
[5] Hyun Duk Kim, Cheng Xiang Zhai, 2009. Generating Comparative Summaries
of Contradictory Opinions in Text. In Proceedings of CIKM’09, November 2–
6, 2009, Hong Kong, China.
[6] P. Achananuparp, X. Hu, and X. Shen. 2008 The evaluation of sentence
similarity measures. DaWaK ’08: Proceedings of the 10th international
conference on Data Warehousing and Knowledge Discovery. 305–316.
Springer Verlag
[7] Timothy J Ross. 2008. Fuzzy Logic with Engineering Applications. Second
Edition. Wiley India.
[8] H.Liu. 2005. Toward integrating feature selection algorithms for classification
and clustering., IEEE Transactions on Knowledge and Data Engineering
[9] J Hipp, U Guntzer et al. 2001. Data Quality Mining – Making a Virtue of
Necessity. In Proc. of SIGMOD DMKD Workshop.
[10] J.Dean, 2008 MapReduce: simplified data processing on large clusters.
Communications of the ACM - 50th anniversary issue: 1958 - 2008 ACM
Press Frontier Series. ACM, New York, NY, 19-33. DOI=
http://doi.acm.org/10.1145/1327452.1327

G. Publication

Based on the above work, the a research paper titled “An algorithm for fuzzy-
based Sentence-level Document Clustering for Micro-level Contradiction
Analysis” was written by the principal investigator along with the guidance and
co-authorship of Dr.S. Rajalakshmi, Head, CSE Department, SCSVMV University
and the paper was presented in the International Conference on Advances in
Computing, Communications and Informatics held in RMK Engineering College
Chennai from 2-4 Aug. 2012. The paper is indexed in the ACM Digital
Library(http://dl.acm.org/citation.cfm?id=2345413).
The full text of the paper is enclosed as an annexure for detailed reference. The
financial support provided to carry out the research is acknowledged in the
research paper.

Page 22
ANNEXURES

Page 23

You might also like