We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4
Bonfring International Journal of Data Mining, Vol.
1, Special Issue, December 2011 18
ISSN 2250 107X | 2011 Bonfring Abstract--- Data mining is the procedure of extorting patterns from data. At present, it is broadly used in various fields like profiling practices, such as marketing, observation, fraud detection and scientific discovery, bioinformatics research. In this survey, mainly give attention to the classification and clustering of data mining approaches. Data mining includes clustering with difficulties of very large datasets with several classes of different types. This inflicts individual computational need on significant clustering algorithms. Another thing is Classification which is a data mining related to machine learning approach used to identify group membership for data samples. The classification approaches like decision tree induction, Bayesian networks, k- nearest neighbor classifier, case-based reasoning, genetic algorithm and fuzzy logic techniques are used widely in many areas. The aim of this survey is to give a wide-ranging evaluation of different classification and clustering techniques in data mining. This investigation evidently analysis the clustering and classification in the review and finally concludes which is clustering and classification is better for various fields. I ndex Terms--- Data mining, Clustering, Classification, Knowledge Extraction, Support Vector Machine, K Means Clustering I. INTRODUCTION The main aim of this survey is to present complete review of various clustering techniques and classification approaches in data mining. Clustering is a separation of data into groups of parallel objects. Representing data by smaller quantity of clusters essentially loses definite details, but attains generalization. It corresponds to numerous data objects by few clusters and therefore it form data by their clusters. Data modeling situate clustering in a historical viewpoint embedded in mathematics, statistics and numerical investigation. As a result, data mining comprises of collection of data and managing data, it also contains study and prediction. Clustering is an unsupervised learning; it studies about the observation before instances. There is no predefined class label survived for the data points. Cluster analysis is used in a number of functions such as data investigation, image processing, market analysis etc. It helps in attaining all the
R. Malathi Ravindran, Research Scholar, Assistant Professor of MCA, NGM College, Pollachi. Dr.N. Nalayini, Associate Professor, Department of Computer Science, NGM College, Pollachi. distribution of patterns and correlation amongst data objects [1]. Classification is another data mining task which is used to classify the required data in effective manner. The goal is to calculate the value of the class labels of a user-specified goal attribute depends on the values of other attributes known as predictive attributes. In classification process, the data mining algorithms can follow three different learning approaches, they are supervised, unsupervised, or semi-supervised. In supervised learning, the algorithm works with a group of examples whose labels are well-known. In classification task, the labels are known as nominal values, numerical values in the view of the regression task. In unsupervised learning, the labels in the dataset are unidentified and the algorithm normally aims at grouping examples according to the similarity of their attribute values, exemplify a clustering task. At last, semi-supervised learning is generally used when a small subset of labeled examples is obtainable, jointly with a large number of unlabeled examples. In this work study the data mining framework, this includes clustering and classification. This paper provides the analysis of existing classification methods for several areas. II. SURVEY OF CLUSTERING AND CLASSIFICATION TECHNIQUES Density-based algorithms in data mining need a metric space, were the usual setting for them is spatial data clustering (Han et al. 2001; Kolatch 2001). To produce practical calculation, various index of data is build up such as R*-tree. Classic indices were useful only with rationally low- dimensional data. The algorithm DENCLUE that, indeed, is a combination of density-based clustering and a grid-based preprocessing is lesser exaggerated by data dimensionality. The majority of frequent requirement is to bound number of cluster points. Alas, k-means algorithm, which provides frequently a number of very small (in certain implementations empty) clusters. The modification of the k-means objective function and of k-means updates that integrate lower limits on cluster volumes is discussed in [Bradley et al. 2000 [4]]. This comprises soft assignments of data points with coefficients subject to linear program needs. Banerjee & Ghosh [2002 [5]] introduced another changes to k-means algorithm. Their objective function is related to an isotropic Gaussian mixture with widths inversely relative to points that are presented in the clusters results in frequency sensitive k-means. Strehl & Ghosh 2000[6] presented a balanced cluster used to convert a particular task in a graph partitioning problem. A Thorough Investigation on the Clustering and Classification Techniques in Various Applications R. Malathi Ravindran and Dr.N. Nalayini Bonfring International Journal of Data Mining, Vol. 1, Special Issue, December 2011 19 ISSN 2250 107X | 2011 Bonfring T. Sakamoto et al [10] proposed a phylogenetic analysis of the whole genome series from the instances attained from the Arctic and those from Japan and Asia exposed six distinct clusters in HBV/B. In each HBV genotype C subgroup, more than a few clusters with genomic similarity to one another can be found. There are two types of genotypes, B and C, in the 200 plus HBV DNA sequences are gathered particularly for this project. At the same time genotype B HBV seen to be a homogenous group [11], the outcome of phylogenetic tree show that there already three main clusters are present in the genotype C between the HBV strains collected [12]. Sub grouping of HBV genotype C was depends on an intersubgroup variation of nucleotide sequence is discussed in [13]. This is in agreement with earlier phylogenetic investigation with available full-length sequence in the GenBank. The major reason for us to discover individually the markers from the clusters (subgenotypes) attained from clustering analysis is that these subgenotypes exhibit mutations caused by geographical diversity which are not markers for carcinogenic diagnosis. Cases reporting lifetime IDU were approximately utterly contaminated with genotype D and all 12 cases who reported injecting within 6 months prior to diagnosis were infected with genotype D. It must be prominent that because of social popularity or recall biases might cause an underreporting of new IDU [14] the number of cases reporting lifetime Injection Drug Use (IDU) may be more pinpointing of risk than the number reporting recent drug use. Despite the consequences, these findings point toward that IDU is a most important route of spread for HBV in BC and that based on phylogenetic analysis clustering exists. To this end, targeted vaccination may be necessary to decrease the transmission of HBV (genotype D) in such high menace populations as IDU and incarcerated individuals. Rule Learning using Evolutionary Algorithm executes a global search and can handle with characteristic interactions better than the existing classification approaches [15][16]. Moreover, the classification rules produced are trouble-free and easily interpretable by human experts who regularly make use of the same reasoning approach very much analogous to the rules. K.B. Xu et al [17] introduced a weighted Choquet integral approach based on fuzzy measure which performs as an aggregation tool to show the feature space evidently onto a real axis optimally correspond to an error criterion, and classifying attribute is appropriately analyzed numerically on the axis concurrently making the classification simple. To implement the classification requirement to find out the unknown parameters, the values of fuzzy measure and the weight function. This can be made by processing an adaptive genetic algorithm on the provided training data. The new- fangled classifier is experienced by recovering the current parameters from a set of artificial training data produced from these parameters. It also performs better on various real-world data sets. Ahead of discriminating classes, this method can also study the scaling needs and the respective significance indexes of the feature attributes with the relationships among them. These parameter values can be used for short-listing significant feature attributes to decrease the complexity (dimensions) of the classification problem. C.C Chang et al [18] aim to be of assistance with the users to effortlessly apply SVM to their real applications. LIBSVM has attained extensive esteem in machine learning areas and in many other fields. The SVM classifier is widely used for classification purpose in data mining and also in other fields. Even though it contains some problems like, there is some difficulties in solving SVM optimization problems, theoretical convergence, multi-class classification, probability estimates, and parameter selection. A decision tree [20] is a tree-structured classifier used to learn the decision tree and also it is employed as a recursive tree rising process. Each test equivalent to an attribute is assessed on the training data by means of a test condition function. The test criterion function allocates each test a score depends on how fine it partitions the data set. The test with the highest score is chosen and located at the source of the tree. The subtrees of each node are afterward grown repeatedly by applying the same algorithm to the instances in each leaf. The algorithm stops when the present node contains either all positive or all negative instances. Naive Bayes classifiers frequently work much better in several difficult real world conditions. H. Zhang [19] introduced an approach on the classification performance of naive Bayes. This classifier demonstrates about the need of dependence distribution which means how the local dependence of a node distributes in each class, consistently or unevenly, and how the local dependencies of all nodes work together, always (supporting a certain classification) plays a critical role. Hence, no matter how well-built the dependences amongst attributes are, naive Bayes can be optimal if the dependences deal out equally in classes, or if the dependences cancel out each other. S. Mika et al [21] introduced a fast training algorithm for the kernel Fisher discriminant classifier. The author utilizes a greedy approximation method and has an empirical scaling behavior which develops the state of the art by more than an order of magnitude, therefore rendering the kernel Fisher algorithm is a feasible option also for large datasets.
Bonfring International Journal of Data Mining, Vol. 1, Special Issue, December 2011 20 ISSN 2250 107X | 2011 Bonfring Conventional Clustering and Classification Approaches Limitations K-Means Clustering Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K affect outcome). Fixed number of clusters can make it difficult to predict what K should be. Does not work well with non- globular clusters. Different initial partitions can result in different final clusters. It is helpful to rerun the program using the same as well as different K values, to compare the results achieved.
Fuzzy C Means Clustering It computes the neighborhood term in each iteration step, which is very time-consuming. Support Vector Machine Classifier Takes longer time for classification An important practical question that is not entirely solved, is the selection of the kernel function parameters K NN classifier The main disadvantage of the KNN algorithm is that it is a insignificant learner, i.e. it does not learn anything from the training data and simply uses the training data itself for classification.
III. PROBLEMS AND DIRECTIONS The aim of the Clustering component to make sure whether clusters survived based on the phylogenetic tree analysis. If clusters are identified, each cluster will be examined independently for data because it will reduce the noise produced by the related data differences and provide much better classification accuracy. In earlier works the clustering approaches doesnt combine the cluster consequence with optimization methods. As a result more noise takes place in the data and becomes less result in the accuracy. The classification is an important data mining task used in many areas. The classification model is supposed to have high sensitivity, appropriate accuracy and specificity for HCC analysis and prediction. The representation educated should also provide obvious indication of the degrees of influence of the attributes toward the classification goal and whether there are any interactions among the predictive attributes. In recent years, there has been lot of research works in the field of clustering and classification. A number of techniques have been proposed by various researchers. In recent years, fuzzy based clustering techniques provide significant results with higher clustering accuracy in lesser clustering time. Evolutionary Algorithm (EA) based clustering approach gives significant results. Moreover, swarm intelligence based classification techniques provide significant results with higher accuracy in lesser classification time. Swarm Intelligence and Neural network approaches based classification algorithms have been widely used in various applications such as gene classification, cancer classification, etc. Swarm intelligence approaches include Particle Swarm Intelligence (PSO), Artificial Bee Colony (ABC), Glow worm Optimization (GSO), etc. Neural Network based approaches include Artificial Neural Network (ANN), Fuzzy Neural Network (FNN), Neuro Fuzzy approaches.
Neural Network and Swarm Intelligence Clustering and Classification Advantages Evolutionary Clustering Algorithm Provides higher classification accuracy with lesser classification time Performs well even in large datasets.
GA clustering Provides optimal solution to the clustering results. PSO and ABC based classification Nature inspired algorithm with lesser error rate It gives optimal classification results for large datasets
IV. CONCLUSION This work clearly survey of existing classification and clustering methods. The individual characteristics and their own specific functionality are studied in efficient manner. It motivates to propose a new clustering method to group the data in efficient manner; it reduces the noise data in the cluster group. Existing classification methods discussed in survey achieves better accuracy, but still it needs alternative classification methods to improve the accuracy of the results in various applications. In future work, different evolutionary based classification algorithms are used to improve the classification accuracy. REFERENCES [1] S.Anitha Elavarasi and Dr. J. Akilandeswari and Dr. B. Sathiyabhama, January 2011, A Survey On Partition Clustering Algorithms [2] HAN, J., KAMBER, M., and TUNG, A. K. H. 2001. Spatial clustering methods in data mining: A survey. In Miller, H. and Han, J. (Eds.) Geographic Data Mining and Knowledge Discovery, Taylor and Francis. [3] KOLATCH, E. 2001. Clustering Algorithms for Spatial Databases: A Survey. PDF is available on the Web. [4] BRADLEY, P. S., BENNETT, K. P., and DEMIRIZ, A. 2000. Constrained k-means clustering. Technical Report MSR-TR-2000-65. Microsoft Research, Redmond, WA [5] BANERJEE, A. and GHOSH, J. 2002. On scaling up balanced clustering algorithms. In Proceedings of the 2nd SIAM ICDM, 333-349, Arlington, VA. Bonfring International Journal of Data Mining, Vol. 1, Special Issue, December 2011 21 ISSN 2250 107X | 2011 Bonfring [6] STREHL, A. and GHOSH, J. 2000. A scalable approach to balanced, high-dimensional clustering of market baskets, In Proceedings of 17th International Conference on High Performance Computing, Springer LNCS, 525-536, Bangalore, India. [7] T. Sakamoto, Y. Tanaka, J. Simonetti, C. Osiowy, M.L. Brresen, A. Koch, F. Kurbanov, M. Sugiyama, G.Y. Minuk, B.J. McMahon, T. Joh, and M. Mizokami, Classification of Hepatitis B Virus Genotype B into 2 Major Types Based on Characterization of a Novel Subgenotype in Arctic Indigenous Populations, J. Infectious Diseases, vol. 196, pp. 1487-1492, 2007. [8] F. Sugauchi, H. Kumada, H. Sakugawa, M. Komatsu, H. Niitsuma, H. Watanabe, Y. Akahane, H. Tokita, T. Kato, Y. Tanaka, E. Orito, R. Ueda, Y. Miyakawa, and M. Mizokami, Two Subtypes of Genotype B (Ba and Bj) of Hepatitis B Virus in Japan, Clinical Infectious Diseases, vol. 38, pp. 1222-1228, 2004 . [9] H.L.Y. Chan, S.K.W. Tsui, E.Y.T. NG, P.C.H. Tse, K.S. Leung, K.H. Lee, T. Mok, A. Bartholomeuz, T.C.C. Au, and J.J.Y. Song, Epidemiological and Virological Characteristics of Two Subgroups of Genotype C Hepatitis Virus, J. Infectious Diseases, vol. 191, pp. 2022- 2032, 2005. [10] S.M. Owyer and J.G.M. Sim, Relationships within and between Genotypes of Hepatitis B Virus at Points Across the Genome: Footprints of Recombination in Certain Isolates, J. General Virology, vol. 81, pp. 379-392, 2000. [11] Perlis TE, Des Jarlais DC, Friedman SR et al. Audio-computerized self- interviewing versus face-to-face interviewing for research data collection at drug abuse treatment programs. Addiction 2004; 99: 885 896. [12] A.A. Freitas, A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery, Advances in Evolutionary Computation, A. Ghosh and S. Tsutsui, eds., Springer-Verlag, 2002. [13] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, and A.K. Jain, Dimensionality Reduction Using Genetic Algorithms, IEEE Trans. Evolutionary Computing, vol. 4, no. 2, pp. 164-171, July 2000. [14] K.B. Xu, Z.Y. Wang, P.A. Heng, and K.S. Leung, Classification by Nonlinear Integral Projections, IEEE Trans. Fuzzy Systems, vol. 11, no. 2, pp. 187-201, Apr. 2003. [15] C.C Chang and C.J. Lin, LIBSVM: A Library for Support Vector Machines, Software, http://www.csie.ntu.edu.tw/~cjlin/ [16] libsvm, 2001. [17] H. Zhang, The Optimality of Naive Bayes, Proc. 17th Intl Florida Alliance of Information and Referral Services (FLAIRS) Conf., 2004. [18] Data Mining Tools See5 and C5.0, Software, http://www. rulequest.com/see5-info.html, May 2006. [19] S. Mika, A.J. Smola, and B. Scholkopf, An Improved Training Algorithm for Fisher Kernel Discriminants, Proc. Artificial Intelligence and Statistics (AISTATS 01), T. Jaakkaola and T. Richardson, eds., pp. 98-104, 2001.