Unit 4 Data Science
Unit 4 Data Science
Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60
Course Outcomes (COs): After the successful completion of the course, the student will be able to:
CO1 Understand the concepts of data and pre-processing of data.
CO2 Know simple pattern recognition methods
CO3 Understand the basic concepts of Clustering and Classification
CO4 Know the recent trends in Data Science
Contents 42 Hrs
Unit I: Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery
in Databases (KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, 8
Problems,Issues and Challenges in DM, DM applications.
Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,
Data Integration and transformation, Data reduction, Discretization 8
Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -
8
Aprioriand Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules
Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes
Classification Methods, Rule-Based Classification, Lazy Learners (or Learning from 10
yourNeighbors), k Nearest Neighbor. Prediction - Accuracy- Precision and Recall
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based
Methods, Grid-Based Methods, Evaluation of Clustering 8
Unit 4
Topics:
Def: Classification is a supervised machine learning method where the model tries to predict the
correct label of a given input data.
1. Learning/Training Step:
Here a model is constructed for classification. A classifier model is built by analyzing the data
which are labeled already. Because the class label of each training tuple is provided, this step is
also known as supervised learning. This stage can also be viewed as a function, y=f(x), that can
predict the associated class label ‘y’ of a given tuple x considering attribute values. This mapping
function is represented in the form of classification rules, decision trees or mathematical formula.
2. Classification/Testing Step:
Here the model that is constructed in the learning step is used to predict class labels for given
data.
Ex: A bank loans officer needs to clarify the loan applicant as safe or risky Figure 8.1
The accuracy of classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier.
History:
Decision Tree Induction is the learning of decision trees from class-labelled training tuples.
Decision Tree is a tree structure, where each internal node (non leaf) denotes a test on the attribute,
each branch represents an outcome of the test and each leaf node holds a class label. The attribute
values of a tuple ‘X’ is tested against the decision tree. A path is traced from the root to leaf to
predict the class label.
Advantages:
Applications:
Method:
Create a node N;
If tuples in D are all of the same class, C then return N as a leaf node labeled with the class C;
Apply attribute selection method(D, attribute list) to find the best splitting criterion;
For each outcome ‘j’ of splitting criterion// partition tuple and sub trees.
If Dj is empty then
Else
Attach the node returned by decision tree generation (Dj, attribute list) to node N
End for
Return N;
Splitting Scenarios:
It is a measure for selecting the splitting criterion that ‘best’ separates a given data partition D
of class labeled training tuples into individual classes. It is also called as splitting rules.
The attribute having the best measure is chosen as the splitting attribute for the given tuples.
1. Information gain
2. Gain ratio
3. Gini index
Information Gain:
Based on the work by Claude Shanon on information theory. Information Gain is defined
as the difference between the original information requirement and new requirement. The attribute
with highest information gain is chosen as the splitting attribute for node N. It is used in ID3.
Where,
Info(D) = ∑𝒎
𝒊=𝟎 𝑃𝑖 𝑙𝑜𝑔₂(𝑃𝑖)
m = m distinct classes
--------------
How much more info is needed to arrive at an exact classification is given by i.e. computing
expected info gain for each attribute.
|𝑫𝒋|
InfoA(D) = -∑𝒗𝒋=𝟏 |𝑫| ∗Info(Dj)
Gain Ratio:
Information gain measure is biased toward test with many outcomes (many partitions).
That is, it prefers to select attributes having a large number of values. For example, consider an
attribute that acts as a unique identifier such as product ID. Split on product ID would result in a
large number of partitions (as many as there are values), each one containing just one tuple. Gain
ratio overcomes this bias. Attribute with maximum gain ratio is selected on splitting attribute. IT
is used in C4.5
It applies a kind of normalization to information gain using a split information value defined as
|𝑫𝒋| |𝑫𝒋|
SplitInfoA(D) = -∑𝒗𝒋=𝟏 |𝑫| *𝐥𝐨𝐠 𝟐 |𝑫|
v = partitions
Gini Index:
Gini(D) = 1 - ∑𝒎
𝒊=𝟏 𝑷𝒊
𝟐
|Ci,D|
Di = probability of tuple in D belong to Ci; Pi = |𝑫|
; m= classes.
Tree Pruning:
When decision tree is built, many of the branches will reflect problems in the training data
due to noise or outliers. Tree pruning removes the branches which are not relevant. Pruned tree are
smaller and less complex and thus easier to understand. They perform faster than unpruned trees.
• Pre pruning: In pre pruning the tree branch is not further split into sub branches by
deciding early using statistical measures like info gain, gini index etc.
• Post pruning: In post pruning the fully grown tree branches is cut and leaf nodes are
added. The leaf is labeled with the most frequent class among the subtree being replaced.
Pruning Algorithms:
▪ Suffers from repetition (same node repeats in the tree) and replication(redundant sub trees).
▪
▪ Use of multivariate split reduces the above problem.
probability that this fruit is an apple, regardless of any possible correlations between the
color, roundness and diameter.
Ex: Probability that a customer X will buy a computer given that we know the
age and income of customer.
Ex: Probability that a customer X (Rs.40,000) will buy a computer given that we know the
customer will buy a computer.
Ex: Probability that any given customer will buy a computer regardless of measurements
on attribute.
3) Since P(X) is constant for all classes, therefore denominator is not considered and only
P(X/Ci) P(Ci) needs to be computed for all the classes.
4) To predict the class label of X, P(X/Ci) P(Ci) is evaluated for all class C and maximum of
P(X/Ci) P(Ci) is assigned as class label.
Dataset (Training)
Weather Play
Sunny Yes
Over cast No
Rainy No
Sunny No
Over cast Yes
Rainy No
Sunny Yes
Over cast No
Sunnyj 1 5
P( )P(No) ( )∗( )
P(No/sunny) = No
=[ 3
3
8
] = (0.33*0.62)/0.37=0.55
𝑃(Sunny)
8
Advantages
- Easy to implement
- Good results are obtained in most of the cases
Disadvantages
- Assumption: class conditional independence, therefore loss of accuracy
- Practically, dependencies exist among variables
- E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Rule-based classifiers are used for classification by defining a set of rules that can be used to assign
class labels to new instances of data based on their attribute values. These rules can be created
using expert knowledge of the domain, or they can be learned automatically from a set of labeled
training data. A rule-based classifier uses a set of IF-THEN rules for classification.
R1: IF age == youth AND student == yes THEN buys computer == yes.
The “IF” part (or left side) of a rule is known as the rule antecedent or precondition.
In the rule antecedent, the condition consists of one or more attribute tests (e.g., age == youth and
student == yes). that are logically ANDed.
The rule’s consequent contains a class prediction (in this case, we are predicting whether a
customer will buy a computer).
R1 can also be written as R1: (age D youth) ^ (student D yes))(buys computer D yes).
If the condition (i.e., all the attribute tests) in a rule antecedent holds true for a given tuple, we say
that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the
tuple.
If more than one rule applies for a given tuple then conflict arises on which rule to be selected.
Conflict resolution strategies such as rule ordering and size ordering must be applied to break the
tie or conflict.
In rule-based ordering, the rules are organized based on priority, according to some measure of
rule quality, such as accuracy, coverage, or size (number of attribute tests in the rule antecedent),
or based on advice from domain experts. When rule ordering is used, the rule set is known as a
decision list. Class is predicted for the tuple based on the priority, and any other rule that satisfies
tuple is ignored.
In size-based ordering, the rule with the toughest requirements, where the toughest means the
longest list of conditions satisfied in the antecedent part of the rule ( e.g., 3 conditions out of 5 are
satisfied) will be chosen and tie is broken.
In rule-based classification, coverage is the percentage of records that satisfy the antecedent
conditions of a rule.
Coverage(R)=n1/n
Where n1= instances with antecedent and n=no of training tuples
Accuracy is the percentage of records that satisfy the antecedent conditions and meet the
consequent values of a rule.
Accuracy(R)=n2/n1
Where n2= instances with antecedent AND consequent
Rule Pruning
The rule is pruned is due to the following reason −
The Assessment of quality is made on the original set of training data. The rule may perform
well on training data but less well on subsequent data. That's why the rule pruning is required.
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has greater
quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
The classification methods discussed such as decision tree induction, Bayesian classification, rule-
based classification, etc., are all examples of eager learners. Eager learners employ two step
approach for classification, i.e., in first step they build classifier model learning from the training
set and in second step they perform the classification on unknow tuples to know class using the
model.
Lazy learning algorithms wait until they encounter a new tuple(From testing dataset), then store
and compare training examples when making predictions. This type of learning is useful when
working with large datasets that have a few attributes. Lazy learning is also known as instance-
based or memory-based learning.
• Computationally expensive
• Required more memory as training data will be loaded only during classification stage.
1. Assign a value to K
2. Calculate the distance(E.g, Euclidean Distance) between the new data entry and all other
existing data entries
3. Arrange the distances in ascending order
4. Determine the k-closest records of the training data set for each new record
5. Take the majority vote to classify the data point.
The Euclidean distance between two points or tuples, say, X1 ={x11, x12, : : : , x1n} and X2
={x21, x22, : : : , x2n}, is
Example:
Advantages
Disadvantages:
• Computationally expensive
• Accuracy reduces if there are noise in the dataset
• Requires large memory
• Need to accurately determine the value of k neighbors
There are four terms we need to know that are the “building blocks” used in computing many
evaluation measures. Understanding them will make it easy to grasp the meaning of the various
measures.
True positives (TP): These refer to the positive tuples that were correctly labeled by the classifier.
E.g., Person with COVID-19 correctly labelled as COIVD-19 positive.
True negatives (TN): These are the negative tuples that were correctly labeled by the classifier.
E.g., Person without having COVID-19 virus is correctly labelled as COIVD-19 negative.
False positives (FP): These are the negative tuples that were incorrectly labeled as positive.
E.g., Person without having COVID-19 virus is incorrectly labelled as COIVD-19 positive.
False negatives (FN): These are the positive tuples that were mislabeled as negative.
Precision and recall are metrics used to evaluate the performance of classification models in
machine learning. Precision is the percentage of positive identifications that are correct, while
recall is the percentage of actual positives that are identified correctly.
• If you have 20 items and the model predicted 10 items, out of these, 4 predictions were
wrong and 6 were correct.
• The precision is (6/10), ie 60% while the recall is (6/20) ie 30%
Accuracy:
The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier. In the pattern recognition literature, this is also referred to as the overall
recognition rate of the classifier, that is, it reflects how well the classifier recognizes tuples of the
various classes. That is,
The sensitivity and specificity measures can be used, respectively, for this purpose. Sensitivity is
also referred to as the true positive (recognition) rate (i.e., the proportion of positive tuples that are
correctly identified), while specificity is the true negative rate (i.e., the proportion of negative
tuples that are correctly identified). These measures are defined as
Confusion Matrix:
A confusion matrix represents the prediction summary in matrix form. It shows how many
prediction are correct and incorrect per class.
Predicted
Dog CAT
Actual Dog 5 1
CAT 2 4
Total testing samples: 12