0% found this document useful (0 votes)

54 views

Minor Research Project Report

The document describes an algorithm for feature selection from large datasets using distributed ensemble classifiers on Hadoop. It involves randomly sampling the dataset across cluster nodes, building decision trees on the samples to compute attribute dependencies, averaging the dependencies across nodes, and selecting features based on the averaged dependencies. The algorithm was implemented and tested on glass and diabetes datasets on a 3 node Hadoop cluster.

Uploaded by

Murugan Annamalai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Minor Research Project Report

Uploaded by

Murugan Annamalai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Report on

Minor Research Project

Design and Development of

Parallel Data Mining Algorithms
using Hadoop

Submitted to

Sri Chandrasekharendra Saraswathi Viswa Maha Vidyalaya

(University established under section 3 of the UGC act, 1956)
Enathur, Kanchipuram – 631561, Tamilnadu.

Principal Investigator
R. Vasanth Kumar Mehta
Asst. Professor
Department of Computer Science & Engineering

Page 1
Acknowledgement

I thank the University authorities for providing encouragement and financial

assistance to carry out this Minor Research Project. I thank the Dean,
Engineering and Technology for his advise. My Research Guide and Head
of the Department Dr. S. Rajalakshmi has provided vital inputs, especially in
writing the research papers. I thank her for the same. My thanks to other
colleagues and technical staff of the CSE Department who have helped me in
carrying out this project.

Page 2
INDEX

S.No. Topics Page

1 Introduction 4

2 Chapter 1 - Algorithm 1 8

3 Chapter 2 – Algorithm 2 14

4 Annexures 23

Page 3
CHAPTER 1

INTRODUCTION

Page 4
1. Very large datasets - Challenge in Data Mining

Data Mining is the process of nontrivial extraction of implicit, previously

unknown, and potentially useful information from data. A common feature
of most data mining tasks is that they are resource intensive and operate on
large sets of data. Data sources measuring in gigabytes or terabytes are now
quite common in data mining. For example, WalMart records 20 million
transactions and AT&T produces 275 million call records every day. This
calls for fast data mining algorithms that can mine huge databases in a
reasonable amount of time.

However, despite the many algorithmic improvements proposed in many

serial algorithms, the large size and dimensionality of many databases
makes data mining tasks too slow and too big to be run on a single
processor machine. It is therefore a growing need to develop efficient
parallel data mining algorithms that can run on a distributed system.

2. Distributed Data Mining – A Solution

Hadoop is an open-source Java-based software platform developed by the

Apache Software Foundation . It lets one easily write and run distributed
applications on large computer clusters to process vast amounts of data.
Hadoop implements Google’s MapReduce programming model on top of a
distributed file system called the Hadoop Distributed File System (HDFS).
MapReduce divides applications into many small blocks of work. HDFS
creates multiple replicas of data blocks for reliability, placing them on
compute nodes around the cluster. MapReduce can then process the data
where it is located.

Page 5
Figure 1. Map-Reduce Programming Model.

The authors describe the MapReduce programming model as follows:

1. The computation takes a set of input key/value pairs, and produces a set of
output key/value pairs. The user of the MapReduce library expresses the
computation as two functions: Map and Reduce.
2. Map, written by the user, takes an input pair and produces a set of
intermediate key/value pairs. The MapReduce library groups together all
intermediate values associated with the same intermediate key I and passes
them to the Reduce function.
3. The Reduce function, also written by the user, accepts an intermediate key I
and a set of values for that key. It merges together these values to form a
possibly smaller set of values. Typically, just zero or one output value is
produced per Reduce invocation.

3. Project objectives

The objectives of project are to:

(a) implement parallel Data Mining algorithms on Hadoop to faciliate
the analysis of massive data sets like those generated in the telecom
industry to produce results like the customer churn details and

(b) To study the performance of the algorithms and compare them with
the sequential implementations

Two specific areas were chosen for the design and implementation of the parallel
algorithms and analysis:
a. Feature selection using distributed ensemble classifiers for very large
datasets

Page 6
b. Sentence-level Document clustering for Micro-level Contradiction Analysis

The above two problems were chosen in view of the huge size of data involved,
which is typically difficult to implement on a single node, and inherently suited for
processing on a cluster of machines.

4. Outcome and deliverables

Two independent algorithms were designed and developed and the same were
implemented. Experiments were carried out with specific datasets and the
outcomes were recorded. The findings were published in two conferences for
validation and acceptance by the wider academic and research community. This
report contains a description of the problem statement, the methodology used,
experimental data and outcomes of both the experiments.

Page 7
CHAPTER 2

ALGORITHM 1

Feature Selection using

Distributed Ensemble Classifiers
for
Very Large Datasets

Page 8
A. Introduction
Advances in data collection and storage capabilities during recent times have led to
an information overload. The number and size of Very Large Datasets is
constantly on the rise. Such datasets, in contrast with smaller, more traditional
datasets that have been studied extensively in the past, present new challenges in
data analysis. Traditional statistical methods break down partly because of the
increase in the number of observations, but mostly because of the increase in the
number of variables associated with each observation. The dimension of the data,
is the number of variables that are measured on each observation.
While the increasing dimensions and size of the dataset is posing a new challenge
to data miners, the solution to this challenge is available in two forms – one, in the
form of high performance computing capabilities through grids and clusters, and
second, in the form of data pre-processing techniques like feature selection that
seek to exploit the fact that not all the measured variables are of equal importance
for representing the underlying phenomena of interest.

B. Abstract of Proposed Solution

The proposed algorithm paper blends these two solutions to the grand challenge of
mining very large datasets through a process of feature selection that is
implemented on a cluster of systems.

The Decision Tree is a hierarchical model for supervised learning whereby the
local region is identified in a -sequence of recursive splits. Hall proposed a new
attribute weighted method based on the degree to which they depended on the
values of other attributes in order to overcome the drawback of the assumption of
independence of attributes in the Naïve Bayes method. Hall presented a filter
method that sets attribute weights for use with naive Bayes. The assumption made
is that the weight assigned to a predictive attribute should be inversely related to
the degree of dependency it has on other attributes. In the proposed algorithm, the
above mechanism is adapted to compute the dependencies of the features in a
distributed computing environment called Hadoop.

Page 9
C. Description of Algorithm
A random sampling is made from the training set and depending on the number of
data nodes in the Hadoop cluster, as many sample sets are made. Each node in the
cluster has a different sample set and will execute the following operation for
dependency computation to estimate the degree to which an attribute depends on
others:

a. construct an unpruned decision tree from the training data

b. note the minimum depth d at which the attribute is tested in the tree. The
deeper an attribute occurs in a tree, the more dependent it is. The depth of the root
is assumed to be 1 and the dependency D of an attribute can be measured using the
Eq.(1): D = 1 / d (1)

where d is the minimum depth at which the attribute is tested. Nodes with more
dependency will have a lesser D value and nodes with lesser dependency will have
a larger D value. If an attribute is not tested, we can assume its dependency as
zero.

c. Make a vector of the dependencies of all the attributes. This will be the
output of each node.

The vectors of estimated dependencies of the attributes from each of the nodes are
now combined by averaging them. This process of averaging helps in offsetting
any bias caused by data sampling. This averaged estimated dependency vector is
now an indicator of the relative dependencies of the attributes. At this stage, we
can take the number of features n to be selected / discarded as an input from the
user and accordingly, select either the top n attributes or the bottom n attributes as
per the sequence in which they occur in the averaged estimated dependency vector.

D. Experimental Results
The algorithm was implemented and tested on the Glass Dataset and the Diabetes
Dataset. The experiments were carried out on a Hadoop Cluster with 3 data nodes.

Page 10
Mapreduce programming is used for implementing the Feature Selection algorithm
as follows:

A. In the Map Phase

a. Collect a random sample from the training dataset into memory from a
sequencefile
b. Build an unpruned decision tree on the sample.
c. Compute the dependency value D for each attribute and generate the
Dependency Vector
c. Write the Dependency Vector to the filesystem.
The Sequence file is a Hadoop-specific compressed binary file format. It’s
optimized for passing data between the output of one MapReduce job to the input
of some other MapReduce job.

B. In the Reduce Phase

1) Iterate over each Dependency Vector and calculate the average vector.
2) Save the vector into a SequenceFile.

The Glass Dataset has 9 attributes and 214 instances. On applying the proposed
method using the C4.5 decision tree algorithm without pruning, dependency values
were obtained in each of the three clusters. The Features are re-ordered based on
the estimated dependency. The most insignificant features were removed and the
impact of this feature removal was measured by carrying out classification on the
reduced data set.
The above steps were repeated on the Diabetes Dataset.
The detailed results are provided in the enclosed research paper.

E. Conclusion
The results proved that the algorithm, while increasing the average precision and
recall significantly reduces the Tree size. Hence, the experimental results have
proved that the proposed solution can be used to significantly reduce the

Page 11
computing effort for performing data mining on very large data sets along with an
increase in accuracy as indicated by the values of higher precision and recall
obtained on the reduced dataset.

F. References
[1] Molina, L.C., Belanche, L., Nebot, A., 2002, Attribute Selection Algorithms: A
survey and experimental evaluation.Proceedings of 2nd IEEE's KDD, pp 306-
313.
[2] Guyon, I., Elisseeff, A.2003, An Introduction to Variable and Feature
Selection. Journal of Machine Learning Research 3, pp 1157-1182.
[3] Kohavi, R., John, G.H, 1997, Wrappers for feature subset selection. Artificial
Intelligence 97, pp 273-324
[4] Yu, L., Liu, H, 2003, Feature Selection for High-Dimensional Data: A Fast
Correlation-based Filter Solution. Proc. Int.Conference, ICML2003, pp 856-
863.
[5] Mark Hall, A decision tree-based attribute weighting filter for naive Bayes,
Knowledge-Based Systems, Volume 20, Issue 2, March 2007, Pages 120-126,
ISSN 0950-705.
[6] George H. John, Pat Langley: Estimating Continuous Distributions in Bayesian
Classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence,
San Mateo, 338-345, 1995.
[7] Jeffrey Dean and Sanjay Ghemawat , MapReduce: Simplified Data Processing
on Large Clusters. Communications of the ACM (2008),
Volume: 51, Issue: 1, Publisher: ACM, Pages: 1-13, ISSN: 00010782
[8] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An
Update; SIGKDD Explorations, Volume 11, Issue 1.
[9] B. German . & Vina Spiehler (1987). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
[10] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann,
1992.
[11] Michael Kahn, UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.

Page 12
G. Publication

Based on the above work, the a research paper titled “ Feature Selection using
Distributed Ensemble Classifiers for Very Large Datasets” was written by
the principal investigator along with the guidance and co-authorship of Dr.S.
Rajalakshmi, Head, CSE Department, SCSVMV University and the paper
was presented in the 4th International Conference on Electronics Computer
Technology - ICECT 2012 held at Kanyakumari from 6-8 April 2012. The
paper is archived in the IEEE Xplore and indexed by Ei Compendex and ISI.

The full text of the paper is enclosed as an annexure for detailed reference. The
financial support provided to carry out the research is acknowledged in the
research paper.

Page 13
CHAPTER 3

ALGORITHM 2

Fuzzy-based
Sentence-level Document Clustering
for
Micro-level Contradiction Analysis

Page 14
A. Introduction
The general objective of document clustering is to form groups of documents with
similar themes. In discovering this similarity, the usual approach is to develop a
term-document matrix listing all the terms that are present document-wise, and
finding similarity of documents based on this matrix. While dimension reduction is
imperative for efficiently manipulating the massive quantity of data, to be useful,
this lower dimensional representation must be a good approximation of the
original document set given in its full space. This does not hold good for
documents that have contradictory statements within themselves. For instance, in
the healthcare domain, a report may contain the opinion of three doctors, and a
second report may contain the opinion of some other doctors. The contradictory
opinions of doctors in the two different reports may nullify each others’ effects and
it may result in a wrong clustering of the two documents.

B. Abstract of Proposed Solution

The proposed algorithm instead of using the whole term-document matrix for
comparing documents, uses only a subset of it to perform sentence-level
comparisons, thereby identifying micro-level contradictions that are present in the
documents. In this process, we not only cluster the documents correctly, we also
identify the contradictory documents, as well as the reason and extent of
contradiction between documents. Fuzzy-logic based approach is used to draw
conclusions about the similarity of two documents.

C. Description of Algorithm
1. Standard pre-processing operations for data cleaning, dimension reduction
(like stemming, removal of stop words etc.)
2. For each document Di, i=1..n containing words Wi1,Wi2….Wik
a. For each sentence in Di, compare its similarity with every sentence
of Dj, j= i+1..n using a similarity measure and save the result in a
matrix.

Page 15
b. Depict the sentence similarity visually using the suggested
visualization scheme.
c. Generate a list of absolutely similar sentences and put them in
separate clusters
d. Generate the list of most contradictory sentences and put them in a
separate cluster
e. Evaluatee the similarity of the two documents using the fuzzy
threshold and output the value as similar documents, contradictory
documents or documents with no relationship, with a percentage of
similarity/contradiction.
f. Use the lists generated in steps b and c for all the subsequent
comparisons between the next set of documents, to avoid repetition
of comparisons.
g. Display the sentence similarities between documents
3. Perform document clustering using agglomerative clusteri
clustering
ng or any other
similar technique.
4. Display the document clusters, themes as well as contradictory documents.
Similarity Measurement
Every word in a sentence is matched against every other word in the other sentence
to find its best matching score. Then tthe
he sum of all the matching scores is taken,
and finally the sum is normalized by the total number of words in the two
sentences.
The content similarity can be quantified by the following formula:

CS=
Where s1 and s2 are the total counts of words in the two sentences and w(a,b)
∈ 0,1 is a term similarity function, which is defined as follows:
w =1, if a is same as b
=sim(a,b) where sim(a,b) is the semantic term similarity given by the
WordNet Similarity tool.

Page 16
Even prior to the above step, the two sentences are checked to see if they confirm
to the same sentiment, or they have opposite sentiments by identifying any
opposites present in them. For example, a sentence “it is easy to ride this path”
when compared with a sentence “the road is tough to drive” will be taken as
sentences with opposite polarities. In such cases, the opposite terms, namely easy
and tough will be removed and the similarity of the sentences will be compared as
described above. The resultant similarity value CS, which lies in the range of [0,1]
will be taken with the opposite sign i.e. with the negative sign. This will help
identify the most contradictory statements.
At the end of this similarity comparison, we get a matrix where every sentence is
matched with every other sentence in the two documents and their content
similarity now lies in the range [-1,1]. The content similarity can now be
interpreted as follows:
CS (S1, S2) < 0 => Sentences S1, S2 are contradictory
=0 => Sentences S1, S2 are unrelated
>0=> Sentences S1, S2 are similar
The similarity increases as we move towards 1 and the contradiction increases as
we move towards -1. The similarity/contradiction can be represented visually by

-1 0 1

Contradiction Similarity

At this stage, we build an absolute similarity cluster to which all the absolutely
similar sentences (i.e. with CS=1) are added along with the Document id. A similar
absolute contradiction cluster is built for sentences with CS=-1.
As an additional step to present the results in an easily interpretable manner, a
visualization of the similarity measurements between the sentences with a coloring
scheme is suggested where the content similarity value is converted to a RGB
value such that the most contradictory statements are indicated in red, neutral ones
are yellow and similarity increases from yellow to green, with most similar
sentences depicted in dark green. The formula to compute colour is as follows:

Page 17
[R, G, B] = [255*|CS|, 255-255*|CS|, 0], if CS<0
= [255-255*|CS|, 255*|CS|, 0], if CS>0
= [255, 255, 0], if CS=0
Where CS represents the content similarity between the two sentences.
Every sentence is paired up with every other statement, helping us identify a
sentence’s most similar or contradictory statement very easily.
Once the sentence-level relationships have been measured in detail, the next step is to
consolidate the same to infer about the relationship between the two documents. This is
computed as follows:

DS = 1/ CSSk, PSk

where DS is the document similarity
n is the number of sentences in the larger document
S represents a sentence and
PS represents a paired sentence. A paired sentence is the most similar or most
contradictory sentence from the other document.
The above summation consolidates the effect of the individual sentences in the two
documents. By virtue of normalization by dividing this summation with n, DS, the
Document Similarity will lie in the range of [-1, 1] as the value of CS lies in the range of
[-1, 1] as shown earlier. This Document Similarity Matrix will be of the following form:
D1 D2 D3 D4
D1 1
D2 0.8 1
D3 -0.7 0.4 1
D4 0.2 0.1 -0.8 1
The above matrix is symmetric and hence, only the values below the diagonal are shown.
The relationship between the documents can be graphically represented as follows:
(D3,D4) (D1,D3) (D2,D4)
(D1,D4) (D2,D3) (D1,D2)
-1 0
1
Fuzziness is inherent in classifying two documents as contradictory or similar ones.
Hence, we use a fuzzy inference scheme where the following thresholds are set to
interpret the obtained value of Document Similarity DS between two documents:
-0.8 <DS <= -1 =>Documents are highly contradictory
-0.3 <DS <= -0.8 =>Documents are fairly contradictory
0<DS <= -0.3 =>Documents are slightly contradictory
DS= 0 => Documents have no relationship

Page 18
0>DS >= 0.3 => Documents are Slightly Similar
0.3>DS >=0.8 => Documents are Fairly Similar
0.8>DS >= 1 => Documents are Highly Similar
The absolute value of DS can be interpreted as the percentage of similarity between any
two documents. Again, we could use the visualization scheme suggested earlier for the
sentence level similarity depiction to visualize the Document Similarity also.

The above scheme is used to compare the document similarity to perform the document
clustering using some standard technique like agglomerative clustering iteratively as
follows:

Using the Document Similarity Matrix shown above, the two most similar documents are
combined to form a cluster and the process is repeated till no more elements are left for
combination. For example

Step 1: The documents D2 and D1 are combined since they have max. similarity of 0.8.

Step 2: In the next step, the average similarity of other documents from D2 and D1 is
computed. For example, D3 to {D1, D2} is -0.15. D4 to {D1, D2} is 0.15. D3 to D4 is -
0.8. Among these three, we find that D4 is similar to {D1, D2}. Hence, D4 is combined.

Step 3: In the next step, when we compare D3, we find that it is contradictory. Hence, we
treat it as a separate cluster.

D. Analysis
The dominant step in the algorithm is the sentence-level clustering step where
sentence in a document is compared with every other sentence from all documents.
Let each document Di have Sj sentences and let each sentence have Wk words in
it. Assuming there are n documents, the first document will be compared with n-1
documents, the second with n-2, the third with n-3 and so on. In each comparison,
the number of word comparisons involved will be product of the number of words
in first document with the number of words in second document (i.e.)
Hence, the number of comparisons
- for first document will be w1 (w2+w3+w4…..wn) where n is the number of
documents.
- for second document will be w2 (w3+w4…..wn) where n is the number of
documents

Page 19
In general, for Document i, the number of comparisons will be
Wi(Wi+1, Wi+2…..Wn) ie. ∗ ∑
The overall number of comparisons for all documents will be

Wi ∗

where n is the number of documents

While we can justify the high number of comparisons in view of the critical
applications to which this algorithm will be applied, we have suggested here some
steps that can greatly reduce the comparisons by optimizing the algorithm and
further, a few other factors that will speed-up the algorithm:
a. In each stage of comparison in the algorithm, an absolutely similar
sentences cluster and an absolutely contradictory sentences cluster are
generated. This means that instead of comparing every sentence in each
document with every other sentence, we can compare sentence with only
any one representative sentence from the absolutely similar cluster and any
one representative sentence from the absolutely contradictory cluster,
without any loss of information or accuracy. This will greatly reduce the
number of word comparisons required, especially as the algorithm
progresses as the two sentence similarity and contradiction clusters will
keep increasing in size with increase in number of completed comparisons.
As an example, if we have already found that D1 has a sentence “Details are
sketchy” and “The amount of information available is hazy” will be in a
similar sentence cluster. A new sentence can be compared with any one of
the above two to measure the similarity.
b. An interesting preprocessing task can be performed on every document with
this very algorithm as a feature-reduction step. This can be done by
comparing each document with itself. The absolutely similar cluster and the
absolutely contradictory cluster will have sentence sets that are conveying
the same information redundantly. Hence, we can retain only one
representative from each of the above sentence sets, thereby leading to
feature reduction at sentence level. Such a preprocessing activity would fall

Page 20
under the domain of Data Quality Mining, which refers to the use of Data
Mining Techniques to enhance the quality of data, especially in the
preprocessing stage. Further, this same preprocessing task can be applied to
several other text mining tasks like summarization etc. This preprocessing
task can be done in parallel for all the documents.
c. With a marginal loss of accuracy, we can approximate the criterion for
formation of the absolutely similar or contradictory clusters. While we have
considered two sentences to be absolutely similar only if their Content
Similarity is =1 and to be absolutely contradictory only if their Content
Similarity is =-1, we can relax this to the extent required, leading to more
sentences being clustered as absolutely similar or contradictory ones. This
leads to an increase in the speed with corresponding loss of accuracy.
d. Other indirect means can be applied to enhance the turnaround time of the
algorithm. The algorithm can be easily adapted to perform distributed
matching using some distributed programming techniques like Hadoop’s
map-reduce and hence, high performance computing systems can be
exploited to obtain quick results.
E. Conclusion
The experimental results have shown that the algorithm is quite successful in
finding the fully-correct and fully-wrong answers. In most of the cases, it follows
the general trends as indicated by the manual evaluation process. The suggested
algorithm has multiple benefits as indicated earlier for analysis of text documents
as well as for preprocessing text documents for further analysis. Tasks that can be
performed include document summarization, in contradiction analysis, in visual
depiction of intra- and inter-document similarities. The algorithm can be further
extended to work in a distributed environment.

F. References
[1] I Dhillon et al. 2004. Feature selection and document clustering. Survey of
Text Mining: Clustering, classification and retrieval, Vol. 1 . 74-100.
[2] Farial Shahnaz, Micael W. Berry, V. Paul Pauca. 2006. Document Clustering
using nonnegative matrix function. Information Processing & Management.
Volume 42, Issue 2. Pages 373–386

Page 21
[3] H. Liu., H. Motoda. 2010. Feature Selection – An ever evolving frontier in
Data Mining. Journal of Machine Learning Research, Workshop and
Conference Proceeding.
[4] Michael J. Paul, ChengXiang Zhai, and Roxana Girju. 2010. Summarizing
contrastive viewpoints in opinionated text. In Proceedings of the 2010
Conference on Empirical Methods in Natural Language Processing (EMNLP
'10). Association for Computational Linguistics, Stroudsburg, PA, USA, 66-76.
[5] Hyun Duk Kim, Cheng Xiang Zhai, 2009. Generating Comparative Summaries
of Contradictory Opinions in Text. In Proceedings of CIKM’09, November 2–
6, 2009, Hong Kong, China.
[6] P. Achananuparp, X. Hu, and X. Shen. 2008 The evaluation of sentence
similarity measures. DaWaK ’08: Proceedings of the 10th international
conference on Data Warehousing and Knowledge Discovery. 305–316.
Springer Verlag
[7] Timothy J Ross. 2008. Fuzzy Logic with Engineering Applications. Second
Edition. Wiley India.
[8] H.Liu. 2005. Toward integrating feature selection algorithms for classification
and clustering., IEEE Transactions on Knowledge and Data Engineering
[9] J Hipp, U Guntzer et al. 2001. Data Quality Mining – Making a Virtue of
Necessity. In Proc. of SIGMOD DMKD Workshop.
[10] J.Dean, 2008 MapReduce: simplified data processing on large clusters.
Communications of the ACM - 50th anniversary issue: 1958 - 2008 ACM
Press Frontier Series. ACM, New York, NY, 19-33. DOI=
http://doi.acm.org/10.1145/1327452.1327

G. Publication

Based on the above work, the a research paper titled “An algorithm for fuzzy-
based Sentence-level Document Clustering for Micro-level Contradiction
Analysis” was written by the principal investigator along with the guidance and
co-authorship of Dr.S. Rajalakshmi, Head, CSE Department, SCSVMV University
and the paper was presented in the International Conference on Advances in
Computing, Communications and Informatics held in RMK Engineering College
Chennai from 2-4 Aug. 2012. The paper is indexed in the ACM Digital
Library(http://dl.acm.org/citation.cfm?id=2345413).
The full text of the paper is enclosed as an annexure for detailed reference. The
financial support provided to carry out the research is acknowledged in the
research paper.

Page 22
ANNEXURES

Page 23

Solar Calculation
60% (5)
Solar Calculation
15 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Data Mining Techniques, Arun K. Pujari
No ratings yet
Data Mining Techniques, Arun K. Pujari
303 pages
Instrumentation For Engineering Measurements-Dally JW, Rile WF
25% (8)
Instrumentation For Engineering Measurements-Dally JW, Rile WF
11 pages
The Second Shift Theorem
100% (1)
The Second Shift Theorem
8 pages
Hadoop Based Feature Selection and Decision Making Models On Big Data
No ratings yet
Hadoop Based Feature Selection and Decision Making Models On Big Data
6 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Synopsis - 119997107013 - Dineshkumar B. Vaghela
No ratings yet
Synopsis - 119997107013 - Dineshkumar B. Vaghela
23 pages
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Sample Doc Final
No ratings yet
Sample Doc Final
21 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Comparative Study of Data Mining Tools
No ratings yet
Comparative Study of Data Mining Tools
8 pages
Big Data and hadoop
No ratings yet
Big Data and hadoop
8 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Hybrid Decision Tree-Based Machine Learning Models For Short-Term Water Quality Prediction.
No ratings yet
Hybrid Decision Tree-Based Machine Learning Models For Short-Term Water Quality Prediction.
14 pages
A Survey On Data Mining Algorithms On Apache Hadoop Platform
No ratings yet
A Survey On Data Mining Algorithms On Apache Hadoop Platform
3 pages
Article Intéressant
No ratings yet
Article Intéressant
23 pages
Data Processing For Large Database Using Mapreduce Approach Using Apso
No ratings yet
Data Processing For Large Database Using Mapreduce Approach Using Apso
59 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Personalization Guide
No ratings yet
Personalization Guide
87 pages
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
No ratings yet
Predictive Data Mining and Discovering Hidden Values of Data Warehouse
5 pages
Sakhr - Chaib - Paper On Data Mining
No ratings yet
Sakhr - Chaib - Paper On Data Mining
3 pages
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Chapter 1 - What is Data Mining
No ratings yet
Chapter 1 - What is Data Mining
8 pages
BDA RepeatedImp Questions
No ratings yet
BDA RepeatedImp Questions
30 pages
A Study of Heuristically-Based Parametric Performance Optimization Algorithms For Bigdata Computing
No ratings yet
A Study of Heuristically-Based Parametric Performance Optimization Algorithms For Bigdata Computing
3 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
Data Mining Techniques Thesis PDF
100% (3)
Data Mining Techniques Thesis PDF
5 pages
StreamMining PDF
No ratings yet
StreamMining PDF
185 pages
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
From Everand
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
David R Swinburne
No ratings yet
StreamMining Manual
No ratings yet
StreamMining Manual
185 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Algorithm and Approaches To Handle Large Data-A Survey
No ratings yet
Algorithm and Approaches To Handle Large Data-A Survey
5 pages
Cctedfrom K Means To Rough Set Theory Using Map Reducing Techniquesupdated
No ratings yet
Cctedfrom K Means To Rough Set Theory Using Map Reducing Techniquesupdated
10 pages
Online Interactive Data Mining Tool: Sciencedirect
No ratings yet
Online Interactive Data Mining Tool: Sciencedirect
6 pages
Data Mining Arun Pujari (2037)
No ratings yet
Data Mining Arun Pujari (2037)
303 pages
Chapter - 1: 1.1 Overview
No ratings yet
Chapter - 1: 1.1 Overview
50 pages
Kou Rid 2015
No ratings yet
Kou Rid 2015
12 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
M.Tech - Dissertation Presentation
No ratings yet
M.Tech - Dissertation Presentation
28 pages
PREPROCESSINGOFDresses Sales DATASET
No ratings yet
PREPROCESSINGOFDresses Sales DATASET
20 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Data Mining
100% (1)
Data Mining
29 pages
Big Data 1 PDF
No ratings yet
Big Data 1 PDF
17 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
21 pages
Data Mining Lab Manual Student_copy_for_print
No ratings yet
Data Mining Lab Manual Student_copy_for_print
24 pages
Assignment Distributed Data Mining
No ratings yet
Assignment Distributed Data Mining
14 pages
Adaptive Processing of User-Defined Aggregates in Jaql: Andrey Balmin Vuk Ercegovac Rares Vernica Kevin Beyer
No ratings yet
Adaptive Processing of User-Defined Aggregates in Jaql: Andrey Balmin Vuk Ercegovac Rares Vernica Kevin Beyer
8 pages
Churn Prediction Using MapReduce and HBa PDF
No ratings yet
Churn Prediction Using MapReduce and HBa PDF
5 pages
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
No ratings yet
Ques 1.give Some Examples of Data Preprocessing Techniques?: Assignment - DWDM Submitted By-Tanya Sikka 1719210284
7 pages
short_answer_datamining
No ratings yet
short_answer_datamining
7 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
21 pages
Map Reduce
No ratings yet
Map Reduce
26 pages
Latest Dissertation Topics in Data Mining and Warehousing
100% (1)
Latest Dissertation Topics in Data Mining and Warehousing
7 pages
[Lecture Notes in Computer Science 8890 Theoretical Computer Science and General Issues] Adrian-Horia Dediu, Manuel Lozano, Carlos Martín-Vide (eds.) - Theory and Practice of Natural Computing_ Third I (1).pdf
No ratings yet
[Lecture Notes in Computer Science 8890 Theoretical Computer Science and General Issues] Adrian-Horia Dediu, Manuel Lozano, Carlos Martín-Vide (eds.) - Theory and Practice of Natural Computing_ Third I (1).pdf
278 pages
Introduction To Bioinformatics 1
No ratings yet
Introduction To Bioinformatics 1
109 pages
Iardet 28
No ratings yet
Iardet 28
19 pages
A Novel Approach in Solving 0/1 Knapsack Problem Using Neural Selection Principle
No ratings yet
A Novel Approach in Solving 0/1 Knapsack Problem Using Neural Selection Principle
5 pages
CWI-Module 5 - Documents Governing Welding Inspection and Qualification (Compatibility Mode)
No ratings yet
CWI-Module 5 - Documents Governing Welding Inspection and Qualification (Compatibility Mode)
62 pages
Business & Finance Chapter - 04
No ratings yet
Business & Finance Chapter - 04
49 pages
CGP Chemistry L2 Mark Scheme
No ratings yet
CGP Chemistry L2 Mark Scheme
2 pages
1589293176csu200-1gamp6 Product-Folder 2019 01 Final
No ratings yet
1589293176csu200-1gamp6 Product-Folder 2019 01 Final
2 pages
(Ebooks PDF) Download Fundamentals of Nursing 11th Edition Archer Knippa Full Chapters
100% (1)
(Ebooks PDF) Download Fundamentals of Nursing 11th Edition Archer Knippa Full Chapters
49 pages
GENERAL-PHYSICS-1-12-Q1-M16
No ratings yet
GENERAL-PHYSICS-1-12-Q1-M16
15 pages
3 - Numerical Methods - Nov 2022
No ratings yet
3 - Numerical Methods - Nov 2022
3 pages
Cie 140 PDF
No ratings yet
Cie 140 PDF
29 pages
English Paper II
No ratings yet
English Paper II
86 pages
microsoft-az-204-dumps-by-montoya-29-01-2024-8qa-certsdeals
No ratings yet
microsoft-az-204-dumps-by-montoya-29-01-2024-8qa-certsdeals
9 pages
Routledge Handbook of Mental Health in Elite Sport, 1st Edition PDF ebook with Full Chapters
100% (7)
Routledge Handbook of Mental Health in Elite Sport, 1st Edition PDF ebook with Full Chapters
16 pages
Adhesive Market India
No ratings yet
Adhesive Market India
3 pages
Vol 4 02
No ratings yet
Vol 4 02
11 pages
Best LIFE-Nature Projects 2007-2008
No ratings yet
Best LIFE-Nature Projects 2007-2008
48 pages
Iida Fvi Station Improvement
No ratings yet
Iida Fvi Station Improvement
10 pages
Oxygen Generation and Storage:: Pressure/Vacuum Swing Adsorption Plant
No ratings yet
Oxygen Generation and Storage:: Pressure/Vacuum Swing Adsorption Plant
4 pages
Ludwig Prandtl (1875-1953)
No ratings yet
Ludwig Prandtl (1875-1953)
3 pages
Guide To Good Commercial Refrigeration Practice: Safety & Environmental Considerations & Standards
No ratings yet
Guide To Good Commercial Refrigeration Practice: Safety & Environmental Considerations & Standards
20 pages
Business Ventures - Syllabus - and - Course - Outline
No ratings yet
Business Ventures - Syllabus - and - Course - Outline
6 pages
Adaptive Antenna Systems
No ratings yet
Adaptive Antenna Systems
35 pages
Eurosilicone Product Catalogue 2012 PDF
No ratings yet
Eurosilicone Product Catalogue 2012 PDF
27 pages
Schools First Initiative
86% (7)
Schools First Initiative
36 pages
Writing English 1
No ratings yet
Writing English 1
48 pages
Speed Control System With Rfid Equipped Vehicle
No ratings yet
Speed Control System With Rfid Equipped Vehicle
27 pages
Write An Important Characteristics of System With Example: Boundaries and Environment in System
No ratings yet
Write An Important Characteristics of System With Example: Boundaries and Environment in System
7 pages
Download ebooks file Stories of Culture and Place 2nd Edition Kenny Michael Smillie Kirsten all chapters
100% (4)
Download ebooks file Stories of Culture and Place 2nd Edition Kenny Michael Smillie Kirsten all chapters
55 pages
RMYA Teachers Report Grade 5-IBIAS
No ratings yet
RMYA Teachers Report Grade 5-IBIAS
25 pages

Minor Research Project Report

Uploaded by

Minor Research Project Report

Uploaded by

Report on

Minor Research Project

Design and Development of

Sri Chandrasekharendra Saraswathi Viswa Maha Vidyalaya

I thank the University authorities for providing encouragement and financial

S.No. Topics Page

Data Mining is the process of nontrivial extraction of implicit, previously

However, despite the many algorithmic improvements proposed in many

2. Distributed Data Mining – A Solution

Hadoop is an open-source Java-based software platform developed by the

The authors describe the MapReduce programming model as follows:

The objectives of project are to:

4. Outcome and deliverables

Feature Selection using

B. Abstract of Proposed Solution

a. construct an unpruned decision tree from the training data

A. In the Map Phase

B. In the Reduce Phase

B. Abstract of Proposed Solution

where n is the number of documents

You might also like