CS402 Mod 3

Uploaded by

BINESH

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

9 views

CS402 Mod 3

Uploaded by

BINESH

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 2

INTRODUCTION TO CLASSIFICATION Classification and prediction are two forms of data analysis that extracts models describing important data classes or to predict future data trends. Such models, called classifiers, predict categorical (discrete, unordered) class labels. Whereas classification predicts, categorical labels, prediction models continues-valued functions. CLASSIFICATION Data classification is a two-step process, consisting of a learning step (where a classification model Is, constructed) and a classification step (where the model is used to predict class labels for given data). The process, is shown for the loan application data of Figure .In the first step, a classifier is bullt describing a predetermined set of data classes or concepts. This Is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or “learning from” a training set made up of database tuples and their associated class labels. A tuple, X, is represented by an n- dimensional attribute vector, X = (x, x2,... xn), depicting n measurements made on the tuple from n database attributes, respectively, Al, A2,... An. 1 Each tuple, X, is, assumed to belong to a predefined class as determined by another database attribute called the class label attribute. PREDICTION Prediction can be viewed as the construction and the use of a model to access the class of an unlabelled sample or to access the value or value ranges of an attribute that a given sample is likely to have. In this view, classification and regression are two major types of prediction problems, where classification is used to predict discrete or nominal values while regression is used to predict, continuous or ordered values. But we refer to the use of prediction to predict the class labels as classification and the use of prediction to predict continuous values (using regression techniques) as prediction. This Is the most accepted view in data mining. ISSUES REGARDING CLASSIFICATION AND PREDICTION 1) Preparing data for classification and prediction ‘+ Data Cleaning: This refers to the preprocessing of data in order to remove or reduce noise(by applying smoothing techniques for ex) and the treatment of missing values(e.g. by replacing a missing value with the most commonly occurring value for that attribute, or ‘with most probable value based on statistics) ‘+ Relevance Analysis: Many of the attributes in the data may be Irrelevant to classification or prediction task Relevance analysis may be performed on data with the alm of removing any Irrelevant or redundant attributes from the learning process. In machine learning, this step is known as feature selection + Data Transformation: The data can be generalized to higher-level concepts. Concept hierarchies may be used for this purpose. This s particularly useful for continuous- valued attributes. For ex. numeric values for attribute income may be generalized to discrete ranges as low, medium and high. Since generalization compress the original training data, fewer input/output operations ‘may be Involved during learning. 2) Comparing Classification Methods * Predictive accuracy: This refers to ability of the model to correctly predict the class label of new or previously unseen data. + Speed: This refers to the computation costs involved in, generating and using the model. * Robustness: This is the ability of the model to make correct predictions given noisy data or data with the missing values * Scalability: This refers to the ability to construct the model efficiently given large amounts of data. + Interpretability: This refers to the level of understanding and insight that is provided by the model. 3) Issues in Classification * Missing Data. Missing data values cause problems during both the training phase and the classification process itself. Missing values in the training data must be handled and may produce an inaccurate result. Missing data ina tuple to be classified must be able to be handled by the resulting classification scheme. There are many approaches to handling missing data: Ignore the missing data. = Assume a value for the missing data. This may be determined by using some method to predict what the value could be, ~ Assume a special value for the missing data. This means that the value of missing data is taken to be a specific value all of its own, * Measuring Performance. Table shows two different classification results using two different classification tools. Determining which is best depends on the interpretation of the problem by users. The performance of classification algorithms is usually examined by evaluating the accuracy of the classification DECISION TREE-BASED ALGORITHMS Itapproach is most useful in classification problems. With this technique, a tree is constructed to model the Classification process. Once the tree is built, it is applied to each tuple in the database and results in a classification for that tuple. There are two basic steps in the technique: building the tree and applying the tree to the database. Most research has focused on how to bt effective application process is straightforward. It approach to classification is to divide the search space Into rectangular regions. A tuple is classified based on the region into which it falls. A definition for a decision tree used in classification is contained in Definition 4.3. There are alternative definitions; for example, ina binary DT the nodes could be labeled with the predicates themselves and each arc would be labeled with yes or no. Given a database D = {t1, ..., tn }where ti = (ti, ..., tih} and the database schema contains the following trees as theattributes {A1 , A2, ..., Ah }. Also given is a set of classes C1, .., Cm}. A decision tree(OT) or classification tree is a tree associated with D that has the following properties: * Each internal node Is labeled with an attribute, Al. + Each arcs labeled with a predicate that can be applied to the attribute associated with the parent. + Each leaf node is labeled with a class, Cj. Solving the classification problem using decision trees is a two-step process: 1. Decision tree induction: Construct a DT using training data.2. For each ti ED, apply the DT to determine its class. C45 The decision tree algorithm C4.5 improves 103 in the following ways: ‘ Missing data: When the decision tree is built, missing data are simply ignored. That is, the gain ratio is calculated by looking only at the other records that have a value for that attribute. To classify a record with a missing attribute value, the value for that item can be predicted based on what is known about the attribute values for the other records. * Continuous data: The basic idea is to divide the data into ranges based on the attribute values for that item that are found In the training sample. ‘© Pruning: There are two primary pruning strategies proposed in C4.5 - With subtree replacement, a subtree is replaced by a leaf node if this replacement results in an error rate close to that of the original tree. Subtree replacement works from the bottom of the tree up to the root. Rules: C4.5 allows classification via either decision trees or rules generated from them. In addition, some techniques to simplify complex rules are proposed. One approach is to replace the left-hand side of a rule by a simpler version if all records in the training set are treated identically. An "otherwise" type of rule can be Used to indicate what should be done if no other rules apply. ‘© Splitting: The 1D3 approach favours attributes with many divisions and thus may lead to overfitting. In the extreme, an GainRatio(D, S) = SnD) attribute (GRE. ABET) that has TDI DT unique value for each tuple in the training set would be the best because there would be only one tuple (and thus one class) for each division. An improvement can be made by taking into account the cardinality of each division. This approach uses the GainRatio as opposed to Gain. The GainRatio is defined as NAIVE BAYESIAN CLASSIFIER Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who did early work in probability and decision theory during the 18th century. Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis such as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(H|X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X. In other words, We are looking for the probability that tuple X belongs to class C, given that we know the attribute dscription of X. P(H|X) is the posterior probability, or 2 posteriori probability, of H conditioned on X. For example, suppose ‘our world of data tuples is confined to customers described by the attributes age and income. Then P(H|X) reflects the probability that customer X will buy a computer given that we know the customer's age and income. 1. Let D be a training set of tuples and thelr associated class labels. As usual, each tuple is represented by an n- dimensional attribute vector, X = (x1, x2,... xn), depicting measurements made on the tuple from n attributes, respectively, Al, A2,..., An. 2. Suppose that there are m classes, C1, C2,..., Cm. Given a tuple, X, the classifier will PCXIH)PCH) predict that X belongs to P(X) the class having the highest posterior probability, conditioned on X. That is, the naive Bayesian classifier predicts that tuple X belongs to the class Ci if and only if P(Ci/ X) > P(CI/ X) for 1 P(X|Cj)P(G) for 1

Classification and Prediction
No ratings yet
Classification and Prediction
21 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Unit 4
No ratings yet
Unit 4
20 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
DWDM 4
No ratings yet
DWDM 4
58 pages
Unit 3
No ratings yet
Unit 3
16 pages
Unit-III Classification
No ratings yet
Unit-III Classification
10 pages
7 Classification
100% (3)
7 Classification
63 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
95 pages
DWDM - Unit - V
No ratings yet
DWDM - Unit - V
93 pages
New Classification11
No ratings yet
New Classification11
98 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
22 pages
Classification
No ratings yet
Classification
81 pages
Unit 2
No ratings yet
Unit 2
55 pages
3 Module DWM
No ratings yet
3 Module DWM
16 pages
DM UNIT-3
No ratings yet
DM UNIT-3
23 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
Classification and Prediction
No ratings yet
Classification and Prediction
69 pages
Classification
100% (1)
Classification
37 pages
Classification
No ratings yet
Classification
33 pages
Unit 4- Classification and Prediction
No ratings yet
Unit 4- Classification and Prediction
72 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
Module 04
No ratings yet
Module 04
75 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Down 4
No ratings yet
Down 4
83 pages
Chapter 4
No ratings yet
Chapter 4
31 pages
Week 4 Part 1 Classification
No ratings yet
Week 4 Part 1 Classification
71 pages
Unit-5_3161610
No ratings yet
Unit-5_3161610
92 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Analysis of Classification Algorithm in Data Mining
No ratings yet
Analysis of Classification Algorithm in Data Mining
4 pages
19-Introduction classification algorithm-18-09-2024
No ratings yet
19-Introduction classification algorithm-18-09-2024
102 pages
Classification
No ratings yet
Classification
73 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
UNIT 3 DM
No ratings yet
UNIT 3 DM
34 pages
Module 3
No ratings yet
Module 3
64 pages
Classification DecisionTreesNaiveBayeskNN
No ratings yet
Classification DecisionTreesNaiveBayeskNN
75 pages
CH 5
No ratings yet
CH 5
84 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
Classification and Prediction
No ratings yet
Classification and Prediction
41 pages
IntroClassificationDA-2024
No ratings yet
IntroClassificationDA-2024
129 pages
Chapter 5 Classification
No ratings yet
Chapter 5 Classification
24 pages
Module 04 Edited
No ratings yet
Module 04 Edited
19 pages
Unit Iv
No ratings yet
Unit Iv
38 pages
3-Classification, Clustering and Prediction
No ratings yet
3-Classification, Clustering and Prediction
142 pages
04. UNIT-IV(DMWH6EM)
No ratings yet
04. UNIT-IV(DMWH6EM)
33 pages
Unit-4 - Data Ware
No ratings yet
Unit-4 - Data Ware
59 pages

CS402 Mod 3

Uploaded by

CS402 Mod 3

Uploaded by

You might also like