0% found this document useful (0 votes)

6 views

Practical Guide and Concepts Data Mining

Uploaded by

Aliza Asad

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Practical Guide and Concepts Data Mining

Uploaded by

Aliza Asad

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Question -1 : Data Cleaning Techniques

Titanic Dataset

The dataset contains information about passengers on the RMS Titanic,

which sank on its maiden voyage in 1912. The goal of many analyses of
the dataset is to predict whether a passenger survived or not based on
various features.

a predictive model that answers the question: “what sorts of people

were more likely to survive?”
Data Cleaning Process
Handling Missing Values:
Age: This column often has missing values, so we can fill those with the median age, which is a common method for
numerical data.
Embarked: The 'Embarked' column may have missing values, so we fill those with the most common value (mode).
Cabin: This column usually has many missing values, so we create a new feature (Has_Cabin) to indicate whether a
passenger had a cabin, instead of using the actual cabin data.
Handling Outliers:
Outliers are data points that are significantly different from other observations. We can identify them using the
Interquartile Range (IQR) method, especially for 'Age' and 'Fare', to remove extreme values that might distort the
analysis.
Handling Inconsistent Values:
Gender: Ensure all values are consistent, e.g., making sure 'male' and 'female' are in lowercase and checking for typos.
Embarked: Strip extra spaces and ensure all values are consistent.
Feature Engineering:
PassengerId and Ticket: These columns are identifiers and not useful for modeling, so they are dropped.
Name: We extract the titles from the names (like Mr., Mrs.) and use them as a new feature, which may help with
predictions.
Validation Rules:
Check that 'Age' is between 0 and 100, 'Fare' is non-negative, 'Sex' contains only 'male' or 'female', and 'Embarked'
contains only valid values ('S', 'C', 'Q').
1. Missing numerical values in a column?
fill with median (which is a common approach for numerical columns
with missing data).
The median is often preferred over the mean for filling missing
values because it is less sensitive to outliers (extreme values) that
can skew the mean.
Question -2 : Data Preprocessing Techniques

Diabetes Dataset

This dataset is originally from the National Institute of Diabetes and

Digestive and Kidney Diseases. The objective of the dataset is to
diagnostically predict whether or not a patient has diabetes, based on
certain diagnostic measurements included in the dataset.
Skewed data refers to the situation where the distribution of a dataset is
not symmetrical but instead has a tail that extends more to one side. This
means that the data points are clustered more on one side of the
distribution, with a long tail on the other side.
Log Transformation: To normalize skewed data, log transformation (or
logarithmic transformation) is applied. This makes the data more symmetric
and easier for machine learning models to work with. Log transformation
reduces the impact of extreme values (outliers) by compressing the large
values while maintaining the smaller ones.
Question - 3: Apriori Algorithm

Grocery Transaction Dataset

We create a sample grocery transaction dataset containing different items.

What is it?

The Apriori algorithm is used to find patterns (called association rules) in

a set of data, like transactions in a supermarket or a collection of events.
It helps in identifying which items tend to appear together. For example, if
you buy bread, you may also be likely to buy butter. The algorithm helps
in discovering these types of patterns in data.
Itemset: This is simply a group of items that appear together in a
transaction. For example, in a supermarket, an itemset could be {bread,
butter}.

Frequent Itemset: These are itemsets that appear often in the dataset.
The Apriori algorithm finds these frequent itemsets to understand what
items are commonly bought together.

Association Rule: Once we find frequent itemsets, we can create rules

that suggest how one item is related to another. For example, if
someone buys bread, they are likely to also buy butter. This would be
an association rule: {bread} → {butter}.
Support:

Support tells us how common or frequent an itemset is in the entire

dataset.
Example: Imagine you have a list of transactions where people
bought different items. You want to know how often the
combination of {bread, butter} happens.
If 10 out of 100 customers bought both bread and butter, the
support for {bread, butter} is 10/100 = 0.1 (or 10%).
Formula: Support = (Number of transactions containing the itemset)
/ (Total number of transactions)
Confidence:
Confidence measures how reliable the association rule is. It tells
us how likely one item is to be purchased if the other item is
purchased.
Example: Let’s say the rule is {bread} → {butter} (If you buy
bread, you will also buy butter).
Confidence tells us the probability that if someone buys bread,
they will also buy butter.
If 8 out of 10 customers who bought bread also bought
butter, the confidence of the rule {bread} → {butter} is 8/10
= 0.8 (or 80%).
Formula: Confidence = (Number of transactions containing
both items) / (Number of transactions containing the first
item)
IQR Method (Interquartile Range Method)
The Interquartile Range (IQR) method is a technique used to identify and remove
outliers from a dataset. Outliers are data points that are significantly higher or lower
than the majority of the data.
What is IQR?
The IQR is a measure of statistical dispersion, or how spread out the data is. It is
calculated as the difference between the third quartile (Q3) and the first quartile (Q1) of
the data.
Q1 (First Quartile): The median of the lower half of the data. It is the 25th percentile.
Q3 (Third Quartile): The median of the upper half of the data. It is the 75th percentile.
IQR = Q3 - Q1: The range between the third and first quartiles.
Identifying Outliers with IQR:
Outliers are often defined as data points that fall outside the following range:
Lower Bound = Q1 - 1.5 * IQR
Upper Bound = Q3 + 1.5 * IQR
"Classification in Data Mining"

"Classification is a supervised learning technique that categorizes data into predefined

classes"

Real-world Applications:
Email filtering (Spam vs. Not Spam)
Medical diagnosis (Disease present vs. absent)
Credit card approval (Approve vs. Reject)
Image recognition (Cat vs. Dog)

Basic Process:
Training: Algorithm learns from labeled data
Testing: Model predicts labels for new data
Evaluation: Measure how well the model performs
Underfitting:
The model is too simple for the data.
It doesn’t learn enough from the training data.
Result: Poor performance on both training and testing datasets.
Cause: The model can’t capture the true patterns.

You only skim the chapter headings. You miss

important details and fail to answer most questions.
Overfitting:
The model is too complex for the data.
It memorizes the training data, including irrelevant details (noise).
Result: Great performance on training data but poor performance
on new data.
Cause: The model learns patterns that are specific to the training
data but not general enough.

You memorize the answers to past quiz questions

word-for-word. If the teacher asks something slightly
different, you get confused.
Good Learning (Best Fit):
You understand the key concepts and principles, so you can
answer both familiar and new questions confidently.
Underfitting: You draw a straight line that doesn’t capture the shape of the data.

Overfitting: You draw a squiggly line that passes through every point, even outliers.

Good Fit: You draw a smooth curve that captures the general pattern of the data.
Classification
Algorithms
"Naive Bayes: Probabilistic Classification"

Core Concept:
Based on Bayes' Theorem: P(A|B) = P(B|A) × P(A) / P(B)
Uses probability to predict class membership
Bayes Theorem calculates the probability of a hypothesis being true given some
observed evidence.
Prior Probabilty

Definition: The initial probability of an event occurring before any new evidence is
considered.
Think of it as "what we know before looking at any specific details"

Real-Life Example:
In a classroom of 100 students:
- 60 students pass
- 40 students fail

Prior Probabilities:
P(Pass) = 60/100 = 0.6
P(Fail) = 40/100 = 0.4

This is our "starting point" - before looking at any specific student's characteristics
Conditional Probability

Definition: The probability of an event occurring given that another event has already occurred.
Think of it as "what's the chance of A happening if we know B has happened"

Real-Life Example:
Out of the 60 students who passed:
- 45 attended all classes
- 15 missed some classes

Conditional Probability:
P(Attended All|Passed) = 45/60 = 0.75
This reads as "probability of attending all classes given that the student passed"

Similarly, out of 40 who failed:

- 10 attended all classes
- 30 missed some classes

P(Attended All|Failed) = 10/40 = 0.25

Formula: P(A|B) = P(B|A) × P(A) / P(B)
B
A In words:
Y Posterior = (Likelihood × Prior) / Evidence
E
S Real Example - Student Performance:
Question: If a student attended all classes, what's the probability they passed?
T
H We know:
E - P(Pass) = 0.6 [Prior]
O - P(Attended|Pass) = 0.75 [Likelihood]
R - P(Attended) = (45+10)/100 = 0.55 [Evidence]
E
M Using Bayes Theorem:
P(Pass|Attended) = (0.75 × 0.6) / 0.55 = 0.82

This means: 82% chance of passing if a student attended all classes

Movie Genre Prediction
Step 1: Calculate Prior Probabilities
Step 2: Calculate Conditional Probabilities
Step 3: Apply Bayes Theorem
Advantages
Disadvantages
Applications
The "Naive" Assumption:
All features contribute independently to the probability, regardless of any correlations
between features.

Real Example - Movie Classification:

Features:
1. Has explosions
2. Has car chases
3. Has gunfights

Reality:
These features are likely correlated (action movies often have all three)

Naive Bayes Assumes:

Each feature independently contributes to the movie being action/non-action
DECISION TREES

Basic Structure:
Root node (starting point)
Internal nodes (decisions)
Leaf nodes (final classifications)
Decision Trees: Tree-based Classification

Example: Loan Approval

Splitting Criteria
The goal of a Decision Tree is to split the data into smaller, more
homogeneous groups. The question is: How do we decide where
to split?

Two main splitting criteria are commonly used:

Gini Impurity
Entropy
Gini Impurity
Entropy and Information Gain
High entropy = The group has lots of different things (very uncertain).

Low entropy = The group has mostly one thing (very certain).
High Information Gain = The split made the data much more certain
(better decision).
Low Information Gain = The split didn’t make much of a difference (not
a great decision).
Imagine a box of fruit. If the box is filled with
90% apples and 10% oranges, we are fairly
certain that if we pick a fruit randomly, it will be
an apple. This box has low entropy.
But if the box has an equal mix of apples and
oranges, we are uncertain about what we'll pick.
This box has high entropy.
The Decision Tree algorithm follows a greedy approach.

It selects the best feature (based on the chosen splitting

criterion) at each step to split the data. The algorithm
continues to split the data at each node until:

All data points in a node belong to the same class (pure

node).
Or a stopping condition is met (e.g., a maximum depth or
minimum number of samples per leaf).
Summary
From Code:

max_depth: Maximum depth of the tree (limits how deep the tree can grow).

min_samples_split: Minimum number of samples required to split an internal node.

min_samples_leaf: Minimum number of samples required to be at a leaf node.

random_state: Controls the randomness of the model’s initialization.

Social Media Tourism Project
67% (3)
Social Media Tourism Project
15 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Q1.Bayes' Theorem
No ratings yet
Q1.Bayes' Theorem
5 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Statistic & Machine Learning: Team 2
No ratings yet
Statistic & Machine Learning: Team 2
42 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Unit 1
No ratings yet
Unit 1
21 pages
WS - Data Analytics Fundamental-R
No ratings yet
WS - Data Analytics Fundamental-R
51 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
253777
No ratings yet
253777
66 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
DS Ex1975
No ratings yet
DS Ex1975
5 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
Outliners
No ratings yet
Outliners
15 pages
Data Minning Unit 2-1
No ratings yet
Data Minning Unit 2-1
10 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
OpenSAP Ds1 Week 3 Transcript
No ratings yet
OpenSAP Ds1 Week 3 Transcript
17 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
ML Unit 2
No ratings yet
ML Unit 2
41 pages
Lab Manual Computer Science & Engineering
No ratings yet
Lab Manual Computer Science & Engineering
29 pages
Data Science Imp Q and A
No ratings yet
Data Science Imp Q and A
29 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
DSI237_GROUP_2
No ratings yet
DSI237_GROUP_2
27 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Unit 3
No ratings yet
Unit 3
55 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Lect 2
No ratings yet
Lect 2
54 pages
DWDM (Unit-4)-2
No ratings yet
DWDM (Unit-4)-2
23 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
Anomalies in dataset
No ratings yet
Anomalies in dataset
4 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Data Quality
100% (2)
Data Quality
16 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
My Notes
No ratings yet
My Notes
15 pages
DMBI Questions
No ratings yet
DMBI Questions
8 pages
Unit - II
No ratings yet
Unit - II
56 pages
CH 2
No ratings yet
CH 2
36 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Measures of Position: MAT C301 Jose Rizal University
No ratings yet
Measures of Position: MAT C301 Jose Rizal University
16 pages
CHAPTER 3 Statistic
No ratings yet
CHAPTER 3 Statistic
15 pages
Chapter 3 PDF
No ratings yet
Chapter 3 PDF
27 pages
Fundamentals of Biostatistics 8th Edition Rosner Solutions Manual - Instantly Accessible In Full PDF Version
100% (5)
Fundamentals of Biostatistics 8th Edition Rosner Solutions Manual - Instantly Accessible In Full PDF Version
53 pages
QUARTILES PPT.
No ratings yet
QUARTILES PPT.
25 pages
A Mini Statistical Research On 1 2
No ratings yet
A Mini Statistical Research On 1 2
11 pages
Ap Statistics Portfolio
No ratings yet
Ap Statistics Portfolio
21 pages
Detailed Solution: Mathematics
No ratings yet
Detailed Solution: Mathematics
12 pages
STAT 1043 Statistics: Week 7
No ratings yet
STAT 1043 Statistics: Week 7
23 pages
Lecture No. 6 Measures of Variability
No ratings yet
Lecture No. 6 Measures of Variability
25 pages
DLP MATH 10 Measure of Position
No ratings yet
DLP MATH 10 Measure of Position
32 pages
Measure of Central Tendency Class 12
No ratings yet
Measure of Central Tendency Class 12
7 pages
Mathematics 10 TQ
No ratings yet
Mathematics 10 TQ
5 pages
Geometric Mean Quartiles Mode
No ratings yet
Geometric Mean Quartiles Mode
12 pages
statitics by Mesfin
No ratings yet
statitics by Mesfin
150 pages
MYP-2 Unit 5 booklet
No ratings yet
MYP-2 Unit 5 booklet
31 pages
ARTICLE - 4.5.2 Visualizing The Box and Whisker Plot
No ratings yet
ARTICLE - 4.5.2 Visualizing The Box and Whisker Plot
4 pages
Oct 21 S1 MS
No ratings yet
Oct 21 S1 MS
13 pages
1.9 Partition Values
No ratings yet
1.9 Partition Values
20 pages
Solutions Manual: Answers To Section One Exercises A Review of Chapters 1 - 4
No ratings yet
Solutions Manual: Answers To Section One Exercises A Review of Chapters 1 - 4
9 pages
G10-Q4-WEEK1-DAY4-Measures of Positions- Quartile of Ungrouped Data
No ratings yet
G10-Q4-WEEK1-DAY4-Measures of Positions- Quartile of Ungrouped Data
6 pages
Add Maths Sba
No ratings yet
Add Maths Sba
27 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
Unit 5 Fod (1) (Repaired)
No ratings yet
Unit 5 Fod (1) (Repaired)
28 pages
Lecture-2 Descriptive Statistics-Box Plot Descriptive Measures
No ratings yet
Lecture-2 Descriptive Statistics-Box Plot Descriptive Measures
47 pages
8602 Spring 2024
No ratings yet
8602 Spring 2024
26 pages
Math 10-Q4-L4-Calculating Measures of Position (Quartiles-Using Mendenhall and Sincich Method)
No ratings yet
Math 10-Q4-L4-Calculating Measures of Position (Quartiles-Using Mendenhall and Sincich Method)
18 pages
Measures of Dispersion EASY
No ratings yet
Measures of Dispersion EASY
14 pages
1 Descriptive
No ratings yet
1 Descriptive
42 pages