0% found this document useful (0 votes)
9 views

CS402 Mod 3

Uploaded by

BINESH
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
9 views

CS402 Mod 3

Uploaded by

BINESH
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 2
INTRODUCTION TO CLASSIFICATION Classification and prediction are two forms of data analysis that extracts models describing important data classes or to predict future data trends. Such models, called classifiers, predict categorical (discrete, unordered) class labels. Whereas classification predicts, categorical labels, prediction models continues-valued functions. CLASSIFICATION Data classification is a two-step process, consisting of a learning step (where a classification model Is, constructed) and a classification step (where the model is used to predict class labels for given data). The process, is shown for the loan application data of Figure .In the first step, a classifier is bullt describing a predetermined set of data classes or concepts. This Is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or “learning from” a training set made up of database tuples and their associated class labels. A tuple, X, is represented by an n- dimensional attribute vector, X = (x, x2,... xn), depicting n measurements made on the tuple from n database attributes, respectively, Al, A2,... An. 1 Each tuple, X, is, assumed to belong to a predefined class as determined by another database attribute called the class label attribute. PREDICTION Prediction can be viewed as the construction and the use of a model to access the class of an unlabelled sample or to access the value or value ranges of an attribute that a given sample is likely to have. In this view, classification and regression are two major types of prediction problems, where classification is used to predict discrete or nominal values while regression is used to predict, continuous or ordered values. But we refer to the use of prediction to predict the class labels as classification and the use of prediction to predict continuous values (using regression techniques) as prediction. This Is the most accepted view in data mining. ISSUES REGARDING CLASSIFICATION AND PREDICTION 1) Preparing data for classification and prediction ‘+ Data Cleaning: This refers to the preprocessing of data in order to remove or reduce noise(by applying smoothing techniques for ex) and the treatment of missing values(e.g. by replacing a missing value with the most commonly occurring value for that attribute, or ‘with most probable value based on statistics) ‘+ Relevance Analysis: Many of the attributes in the data may be Irrelevant to classification or prediction task Relevance analysis may be performed on data with the alm of removing any Irrelevant or redundant attributes from the learning process. In machine learning, this step is known as feature selection + Data Transformation: The data can be generalized to higher-level concepts. Concept hierarchies may be used for this purpose. This s particularly useful for continuous- valued attributes. For ex. numeric values for attribute income may be generalized to discrete ranges as low, medium and high. Since generalization compress the original training data, fewer input/output operations ‘may be Involved during learning. 2) Comparing Classification Methods * Predictive accuracy: This refers to ability of the model to correctly predict the class label of new or previously unseen data. + Speed: This refers to the computation costs involved in, generating and using the model. * Robustness: This is the ability of the model to make correct predictions given noisy data or data with the missing values * Scalability: This refers to the ability to construct the model efficiently given large amounts of data. + Interpretability: This refers to the level of understanding and insight that is provided by the model. 3) Issues in Classification * Missing Data. Missing data values cause problems during both the training phase and the classification process itself. Missing values in the training data must be handled and may produce an inaccurate result. Missing data ina tuple to be classified must be able to be handled by the resulting classification scheme. There are many approaches to handling missing data: Ignore the missing data. = Assume a value for the missing data. This may be determined by using some method to predict what the value could be, ~ Assume a special value for the missing data. This means that the value of missing data is taken to be a specific value all of its own, * Measuring Performance. Table shows two different classification results using two different classification tools. Determining which is best depends on the interpretation of the problem by users. The performance of classification algorithms is usually examined by evaluating the accuracy of the classification DECISION TREE-BASED ALGORITHMS Itapproach is most useful in classification problems. With this technique, a tree is constructed to model the Classification process. Once the tree is built, it is applied to each tuple in the database and results in a classification for that tuple. There are two basic steps in the technique: building the tree and applying the tree to the database. Most research has focused on how to bt effective application process is straightforward. It approach to classification is to divide the search space Into rectangular regions. A tuple is classified based on the region into which it falls. A definition for a decision tree used in classification is contained in Definition 4.3. There are alternative definitions; for example, ina binary DT the nodes could be labeled with the predicates themselves and each arc would be labeled with yes or no. Given a database D = {t1, ..., tn }where ti = (ti, ..., tih} and the database schema contains the following trees as the attributes {A1 , A2, ..., Ah }. Also given is a set of classes C1, .., Cm}. A decision tree(OT) or classification tree is a tree associated with D that has the following properties: * Each internal node Is labeled with an attribute, Al. + Each arcs labeled with a predicate that can be applied to the attribute associated with the parent. + Each leaf node is labeled with a class, Cj. Solving the classification problem using decision trees is a two-step process: 1. Decision tree induction: Construct a DT using training data.2. For each ti ED, apply the DT to determine its class. C45 The decision tree algorithm C4.5 improves 103 in the following ways: ‘ Missing data: When the decision tree is built, missing data are simply ignored. That is, the gain ratio is calculated by looking only at the other records that have a value for that attribute. To classify a record with a missing attribute value, the value for that item can be predicted based on what is known about the attribute values for the other records. * Continuous data: The basic idea is to divide the data into ranges based on the attribute values for that item that are found In the training sample. ‘© Pruning: There are two primary pruning strategies proposed in C4.5 - With subtree replacement, a subtree is replaced by a leaf node if this replacement results in an error rate close to that of the original tree. Subtree replacement works from the bottom of the tree up to the root. Rules: C4.5 allows classification via either decision trees or rules generated from them. In addition, some techniques to simplify complex rules are proposed. One approach is to replace the left-hand side of a rule by a simpler version if all records in the training set are treated identically. An "otherwise" type of rule can be Used to indicate what should be done if no other rules apply. ‘© Splitting: The 1D3 approach favours attributes with many divisions and thus may lead to overfitting. In the extreme, an GainRatio(D, S) = SnD) attribute (GRE. ABET) that has TDI DT unique value for each tuple in the training set would be the best because there would be only one tuple (and thus one class) for each division. An improvement can be made by taking into account the cardinality of each division. This approach uses the GainRatio as opposed to Gain. The GainRatio is defined as NAIVE BAYESIAN CLASSIFIER Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who did early work in probability and decision theory during the 18th century. Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis such as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(H|X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X. In other words, We are looking for the probability that tuple X belongs to class C, given that we know the attribute dscription of X. P(H|X) is the posterior probability, or 2 posteriori probability, of H conditioned on X. For example, suppose ‘our world of data tuples is confined to customers described by the attributes age and income. Then P(H|X) reflects the probability that customer X will buy a computer given that we know the customer's age and income. 1. Let D be a training set of tuples and thelr associated class labels. As usual, each tuple is represented by an n- dimensional attribute vector, X = (x1, x2,... xn), depicting measurements made on the tuple from n attributes, respectively, Al, A2,..., An. 2. Suppose that there are m classes, C1, C2,..., Cm. Given a tuple, X, the classifier will PCXIH)PCH) predict that X belongs to P(X) the class having the highest posterior probability, conditioned on X. That is, the naive Bayesian classifier predicts that tuple X belongs to the class Ci if and only if P(Ci/ X) > P(CI/ X) for 1 P(X|Cj)P(G) for 1

You might also like