0% found this document useful (0 votes)
44 views

Unit 4 Data Science

Uploaded by

sindhukc2010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Unit 4 Data Science

Uploaded by

sindhukc2010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

1

Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60
Course Outcomes (COs): After the successful completion of the course, the student will be able to:
CO1 Understand the concepts of data and pre-processing of data.
CO2 Know simple pattern recognition methods
CO3 Understand the basic concepts of Clustering and Classification
CO4 Know the recent trends in Data Science
Contents 42 Hrs
Unit I: Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery
in Databases (KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, 8
Problems,Issues and Challenges in DM, DM applications.
Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,
Data Integration and transformation, Data reduction, Discretization 8
Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -
8
Aprioriand Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules
Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes
Classification Methods, Rule-Based Classification, Lazy Learners (or Learning from 10
yourNeighbors), k Nearest Neighbor. Prediction - Accuracy- Precision and Recall
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based
Methods, Grid-Based Methods, Evaluation of Clustering 8

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


2

Unit 4

Topics:

Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes


Classification Methods, Rule-Based Classification, Lazy Learners (or Learning from your
Neighbors), k Nearest Neighbor. Prediction - Accuracy- Precision and Recall.

Classification Basic Concepts

Def: Classification is a supervised machine learning method where the model tries to predict the
correct label of a given input data.

Classification is a two step process,

1. Learning/Training Step:

Here a model is constructed for classification. A classifier model is built by analyzing the data
which are labeled already. Because the class label of each training tuple is provided, this step is
also known as supervised learning. This stage can also be viewed as a function, y=f(x), that can
predict the associated class label ‘y’ of a given tuple x considering attribute values. This mapping
function is represented in the form of classification rules, decision trees or mathematical formula.

2. Classification/Testing Step:

Here the model that is constructed in the learning step is used to predict class labels for given
data.

The data used in learning step is called “Training data”.

The data used in classification step is called “Testing data”.

Ex: A bank loans officer needs to clarify the loan applicant as safe or risky Figure 8.1

The accuracy of classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


3

Decision Tree Induction:

History:

• Developed by J Ross Quinlan around 1980’s


• Named it as ID3 (Iterative Dichotomiser)
• Another variant C4.5(successor of ID3)
• Another work CART(Classification And Regression Trees)
• Adopted greedy approach in which decision trees are constructed in a top-down recursive
divide and conquer method

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


4

Decision Tree Induction is the learning of decision trees from class-labelled training tuples.
Decision Tree is a tree structure, where each internal node (non leaf) denotes a test on the attribute,
each branch represents an outcome of the test and each leaf node holds a class label. The attribute
values of a tuple ‘X’ is tested against the decision tree. A path is traced from the root to leaf to
predict the class label.

Advantages:

• DT can be easily converted into classification ruler(if---then)


• DT does not require any domain knowledge and hence easy to construct and interpret
• DT can handle multidimensional data
• DT is simple and fast

Applications:

Medicine, manufacturing and production, finance, astronomy, molecular biology.

Algorithm: Decision Tree Generation

Input: D = set of training tuples with associated class labels.

Attribute list= attributes of tuple.

Attribute_selection_method= a function to determine the splitting criteria that best


partitions the data tuples into individual classes.

Output: A Decision Tree

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


5

Method:

Create a node N;

If tuples in D are all of the same class, C then return N as a leaf node labeled with the class C;

If attribute list is empty then

Return N as the leaf node labeled with the majority class in D;

Apply attribute selection method(D, attribute list) to find the best splitting criterion;

Label node ‘N’ with splitting criterion;

If splitting attribute is discrete valued and multiway splits allowed then

attribute list  attribute list – splitting attribute;//remove split attribute.

For each outcome ‘j’ of splitting criterion// partition tuple and sub trees.

Let Dj be the set of data tuples in D satisfying outcome j;

If Dj is empty then

Attach a leaf labeled with majority class in D to node N;

Else

Attach the node returned by decision tree generation (Dj, attribute list) to node N

End for

Return N;

Splitting Scenarios:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


6

Attribute Selection Measures:

It is a measure for selecting the splitting criterion that ‘best’ separates a given data partition D
of class labeled training tuples into individual classes. It is also called as splitting rules.

The attribute having the best measure is chosen as the splitting attribute for the given tuples.

Popular Attribute Selection Measures:

1. Information gain
2. Gain ratio
3. Gini index

Information Gain:

Based on the work by Claude Shanon on information theory. Information Gain is defined
as the difference between the original information requirement and new requirement. The attribute
with highest information gain is chosen as the splitting attribute for node N. It is used in ID3.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


7

Gain(A) = Info(D) – InfoA(D)

Where,

Info(D) = ∑𝒎
𝒊=𝟎 𝑃𝑖 𝑙𝑜𝑔₂(𝑃𝑖)

m = m distinct classes

Pi = probability that a tuple belongs to class Ci and estimated using

--------------

Info(D) is also known as entropy of D

How much more info is needed to arrive at an exact classification is given by i.e. computing
expected info gain for each attribute.
|𝑫𝒋|
InfoA(D) = -∑𝒗𝒋=𝟏 |𝑫| ∗Info(Dj)

Refer eg: 8.1

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


8

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


9

Gain Ratio:

Information gain measure is biased toward test with many outcomes (many partitions).
That is, it prefers to select attributes having a large number of values. For example, consider an
attribute that acts as a unique identifier such as product ID. Split on product ID would result in a
large number of partitions (as many as there are values), each one containing just one tuple. Gain
ratio overcomes this bias. Attribute with maximum gain ratio is selected on splitting attribute. IT
is used in C4.5

Gain ratio is given by the following formula:


𝐆𝐚𝐢𝐧(𝐀)
Gain ratio(A) = 𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨𝐀(𝐃)

It applies a kind of normalization to information gain using a split information value defined as
|𝑫𝒋| |𝑫𝒋|
SplitInfoA(D) = -∑𝒗𝒋=𝟏 |𝑫| *𝐥𝐨𝐠 𝟐 |𝑫|

v = partitions

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


10

Gini Index:

It measures the impurity/uncertainty/randomness of Dataset D. It is given by the following


formula. It is used in CART.

Gini(D) = 1 - ∑𝒎
𝒊=𝟏 𝑷𝒊
𝟐

|Ci,D|
Di = probability of tuple in D belong to Ci; Pi = |𝑫|
; m= classes.

Tree Pruning:

When decision tree is built, many of the branches will reflect problems in the training data
due to noise or outliers. Tree pruning removes the branches which are not relevant. Pruned tree are
smaller and less complex and thus easier to understand. They perform faster than unpruned trees.

There are two common approaches for tree pruning:

• Pre pruning: In pre pruning the tree branch is not further split into sub branches by
deciding early using statistical measures like info gain, gini index etc.
• Post pruning: In post pruning the fully grown tree branches is cut and leaf nodes are
added. The leaf is labeled with the most frequent class among the subtree being replaced.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


11

Pruning Algorithms:

• Cost complexity pruning algorithm used in CART.


• Pessimistic pruning used in C4.5.

Disadvantages of decision trees:

▪ Suffers from repetition (same node repeats in the tree) and replication(redundant sub trees).

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


12


▪ Use of multivariate split reduces the above problem.

Bayes Classification Method

▪ Bayes Classification (BC) are statistical classifiers.


▪ B C are based on Bayes Theorem
▪ B C predict class membership probabilities such as the probability that a given tuple
belongs to a particular class
▪ A simple B C is called as Naive B C.
▪ N B C assumes that the effect of an attribute value on a given class is independent of the
values of other attributes. This assumption is called class-conditional independence
Example for conditional Independence:
A fruit may be considered to be an apple if it is red, round and 3” diameter. A Naïve
Bayes Classifier considered each of these features to contribute independently to the

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


13

probability that this fruit is an apple, regardless of any possible correlations between the
color, roundness and diameter.

Review of Bayes Theorem:

Named after Thomas Bayes of 18th century.

Bayes Theorem is:


𝑋
𝑃( ) 𝑃(𝐻)
P(H/X) = 𝐻
𝑃(𝑋)

H – Hypothesis that belongs to a particular class

X – Data type(measurements on n-attributes)

P(H/X) – Posterior probability, i.e. probability that X belong to hypothesis made on H

Ex: Probability that a customer X will buy a computer given that we know the
age and income of customer.

P(X/H) - Posterior probability, i.e. probability that X belong to hypothesis made on H

Ex: Probability that a customer X (Rs.40,000) will buy a computer given that we know the
customer will buy a computer.

P(H) – Prior probability of H

Ex: Probability that any given customer will buy a computer regardless of measurements
on attribute.

Naïve Bayesian Classification:

N B classifiers works as follows:

1) Let D be a training set of tuples(with class labels)


X – tuple with n-attributes.
2) Suppose that there are m classes C1, C 2,….., Cm. Given a tuple X, the classifier will predict
that X belong to the class having the highest posterior probability conditioned on X.
i.e.
P(Ci/X) > P(Cj/X) for 1≤j≤m, j≠i.
Thus we maximize P(Ci/X). The class Ci for which P(Ci/X) is maximized is called the
maximum posterior hypothesis.
By Bayes Theorem
𝑋
𝑃( ) 𝑃(Ci)
Ci
P(Ci /X) = 𝑃(𝑋)

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


14

3) Since P(X) is constant for all classes, therefore denominator is not considered and only
P(X/Ci) P(Ci) needs to be computed for all the classes.
4) To predict the class label of X, P(X/Ci) P(Ci) is evaluated for all class C and maximum of
P(X/Ci) P(Ci) is assigned as class label.

Example: Classifying if we can play based on weather

If we can play on a sunny day?

Dataset (Training)

Weather Play
Sunny Yes
Over cast No
Rainy No
Sunny No
Over cast Yes
Rainy No
Sunny Yes
Over cast No

i.e. P(Yes/Sunny) > P(No/Sunny)


Sunnyj 2 3
P( )P(Yes) ( )∗( )
P(Yes/Sunny) = Yes
=[ 3
3
8
] = (0.66*0.37)/0.37=0.66
𝑃(Sunny)
8

Sunnyj 1 5
P( )P(No) ( )∗( )
P(No/sunny) = No
=[ 3
3
8
] = (0.33*0.62)/0.37=0.55
𝑃(Sunny)
8

Since P(Yes/sunny > P(No/sunny) , based on Bayes Classification We can play


on a sunny day.

Advantages
- Easy to implement
- Good results are obtained in most of the cases
Disadvantages
- Assumption: class conditional independence, therefore loss of accuracy
- Practically, dependencies exist among variables
- E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


15

- Dependencies among these cannot be modeled by Naïve Bayes Classifier

Rule Based Classifiers

Rule-based classifiers are used for classification by defining a set of rules that can be used to assign
class labels to new instances of data based on their attribute values. These rules can be created
using expert knowledge of the domain, or they can be learned automatically from a set of labeled
training data. A rule-based classifier uses a set of IF-THEN rules for classification.

An IF-THEN rule is an expression of the form:

IF condition THEN conclusion.

An example is rule R1,

R1: IF age == youth AND student == yes THEN buys computer == yes.

The “IF” part (or left side) of a rule is known as the rule antecedent or precondition.

The “THEN” part (or right side) is the rule consequent.

In the rule antecedent, the condition consists of one or more attribute tests (e.g., age == youth and
student == yes). that are logically ANDed.

The rule’s consequent contains a class prediction (in this case, we are predicting whether a
customer will buy a computer).

R1 can also be written as R1: (age D youth) ^ (student D yes))(buys computer D yes).

If the condition (i.e., all the attribute tests) in a rule antecedent holds true for a given tuple, we say
that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the
tuple.

If more than one rule applies for a given tuple then conflict arises on which rule to be selected.
Conflict resolution strategies such as rule ordering and size ordering must be applied to break the
tie or conflict.

In rule-based ordering, the rules are organized based on priority, according to some measure of
rule quality, such as accuracy, coverage, or size (number of attribute tests in the rule antecedent),
or based on advice from domain experts. When rule ordering is used, the rule set is known as a
decision list. Class is predicted for the tuple based on the priority, and any other rule that satisfies
tuple is ignored.

In size-based ordering, the rule with the toughest requirements, where the toughest means the
longest list of conditions satisfied in the antecedent part of the rule ( e.g., 3 conditions out of 5 are
satisfied) will be chosen and tie is broken.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


16

Coverage and Accuracy: Measures to evaluate the accuracy of the rules.

In rule-based classification, coverage is the percentage of records that satisfy the antecedent
conditions of a rule.
Coverage(R)=n1/n
Where n1= instances with antecedent and n=no of training tuples

Accuracy is the percentage of records that satisfy the antecedent conditions and meet the
consequent values of a rule.

Accuracy(R)=n2/n1
Where n2= instances with antecedent AND consequent

Rule Extraction from a Decision Tree


Use the decision tree and generate the rules from each branch. Where nodes will become
antecedent and leaf node will be consequent. Logical AND is used to combine the nodes of the
tree.
E.g.,

Rule Induction Using a Sequential Covering Algorithm


Sequential Covering is a popular algorithm based on Rule-Based Classification. IF-THEN rules
can be extracted directly from the training data (i.e., without having to generate a decision tree
first) using a sequential covering algorithm.
There are many sequential covering algorithms. Popular variations include AQ, CN2, and the
more recent RIPPER. The general strategy is as follows. Rules are learned one at a time. Each
time a rule is learned, the tuples covered by the rule are removed, and the process repeats on the
remaining tuples.
Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification.
Input:
D: a data set of class-labeled tuples;
Att vals: the set of all attributes and their possible values.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


17

Output: A set of IF-THEN rules.


Method:
(1) Rule set = {}; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
(4) Rule = Learn One Rule(D, Att vals, c);
(5) remove tuples covered by Rule from D;
(6) Rule set = Rule set + Rule; // add new rule to rule set
(7) until terminating condition;
(8) endfor
(9) return Rule Set ;

Rule Pruning
The rule is pruned is due to the following reason −

The Assessment of quality is made on the original set of training data. The rule may perform
well on training data but less well on subsequent data. That's why the rule pruning is required.

The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has greater
quality than what was assessed on an independent set of tuples.

FOIL is one of the simple and effective method for rule pruning. For a given rule R,

FOIL_Prune = pos - neg / pos + neg


where pos and neg is the number of positive tuples covered by R, respectively.

Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.

Advantages of Rule-Based Classifiers


• Has characteristics quite similar to decision trees
• As highly expressive as decision trees
• Easy to interpret
• Faster Performance comparable to decision trees
• Can handle redundant attributes

Lazy Learners: K-Nearest Neighbor Classifier Algorithm

The classification methods discussed such as decision tree induction, Bayesian classification, rule-
based classification, etc., are all examples of eager learners. Eager learners employ two step
approach for classification, i.e., in first step they build classifier model learning from the training
set and in second step they perform the classification on unknow tuples to know class using the
model.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


18

Lazy learning algorithms wait until they encounter a new tuple(From testing dataset), then store
and compare training examples when making predictions. This type of learning is useful when
working with large datasets that have a few attributes. Lazy learning is also known as instance-
based or memory-based learning.

Examples of lazy learners: K-Nearest Neighbor Classifier, case-based reasoning classifiers.

Advantages of Lazy learners:


• can adapt quickly to new or changing data.
• less affected by outliers compared to eager learning methods.
• Handles complex data distributions and nonlinear relationships

Disadvantages of lazy learners:

• Computationally expensive
• Required more memory as training data will be loaded only during classification stage.

K-Nearest Neighbor Classifier Algorithm

• The k-nearest-neighbour method was first described in the early 1950s.


• widely used in the area of pattern recognition.
• The K-Nearest Neighbors (KNN) algorithm is a popular machine learning technique
used for classification and regression tasks. It relies on the idea that similar data
points tend to have similar labels or values.

The steps for the KNN algorithm are:

1. Assign a value to K
2. Calculate the distance(E.g, Euclidean Distance) between the new data entry and all other
existing data entries
3. Arrange the distances in ascending order
4. Determine the k-closest records of the training data set for each new record
5. Take the majority vote to classify the data point.

The Euclidean distance between two points or tuples, say, X1 ={x11, x12, : : : , x1n} and X2
={x21, x22, : : : , x2n}, is

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


19

Example:

Advantages

• Simple and efficient


• Easy to implement
• No assumption required
• No training time required

Disadvantages:

• Computationally expensive
• Accuracy reduces if there are noise in the dataset
• Requires large memory
• Need to accurately determine the value of k neighbors

Metrics for Evaluating Classifier Performance (Precision and Recall)

There are four terms we need to know that are the “building blocks” used in computing many
evaluation measures. Understanding them will make it easy to grasp the meaning of the various
measures.

True positives (TP): These refer to the positive tuples that were correctly labeled by the classifier.
E.g., Person with COVID-19 correctly labelled as COIVD-19 positive.

True negatives (TN): These are the negative tuples that were correctly labeled by the classifier.

E.g., Person without having COVID-19 virus is correctly labelled as COIVD-19 negative.

False positives (FP): These are the negative tuples that were incorrectly labeled as positive.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


20

E.g., Person without having COVID-19 virus is incorrectly labelled as COIVD-19 positive.

False negatives (FN): These are the positive tuples that were mislabeled as negative.

E.g., Person having COVID-19 virus is incorrectly labelled as COIVD-19 negative.

Precision and Recall:

Precision and recall are metrics used to evaluate the performance of classification models in
machine learning. Precision is the percentage of positive identifications that are correct, while
recall is the percentage of actual positives that are identified correctly.

Here's an example of precision and recall:

• If you have 20 items and the model predicted 10 items, out of these, 4 predictions were
wrong and 6 were correct.
• The precision is (6/10), ie 60% while the recall is (6/20) ie 30%
Accuracy:

The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier. In the pattern recognition literature, this is also referred to as the overall
recognition rate of the classifier, that is, it reflects how well the classifier recognizes tuples of the
various classes. That is,

The sensitivity and specificity measures can be used, respectively, for this purpose. Sensitivity is
also referred to as the true positive (recognition) rate (i.e., the proportion of positive tuples that are
correctly identified), while specificity is the true negative rate (i.e., the proportion of negative
tuples that are correctly identified). These measures are defined as

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


21

Confusion Matrix:

A confusion matrix represents the prediction summary in matrix form. It shows how many
prediction are correct and incorrect per class.

Confusion matrix for binary classification.

Predicted
Dog CAT
Actual Dog 5 1
CAT 2 4
Total testing samples: 12

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

You might also like