0% found this document useful (0 votes)
6 views

Practical Guide and Concepts Data Mining

Uploaded by

Aliza Asad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Practical Guide and Concepts Data Mining

Uploaded by

Aliza Asad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Question -1 : Data Cleaning Techniques

Titanic Dataset

The dataset contains information about passengers on the RMS Titanic,


which sank on its maiden voyage in 1912. The goal of many analyses of
the dataset is to predict whether a passenger survived or not based on
various features.

a predictive model that answers the question: “what sorts of people


were more likely to survive?”
Data Cleaning Process
Handling Missing Values:
Age: This column often has missing values, so we can fill those with the median age, which is a common method for
numerical data.
Embarked: The 'Embarked' column may have missing values, so we fill those with the most common value (mode).
Cabin: This column usually has many missing values, so we create a new feature (Has_Cabin) to indicate whether a
passenger had a cabin, instead of using the actual cabin data.
Handling Outliers:
Outliers are data points that are significantly different from other observations. We can identify them using the
Interquartile Range (IQR) method, especially for 'Age' and 'Fare', to remove extreme values that might distort the
analysis.
Handling Inconsistent Values:
Gender: Ensure all values are consistent, e.g., making sure 'male' and 'female' are in lowercase and checking for typos.
Embarked: Strip extra spaces and ensure all values are consistent.
Feature Engineering:
PassengerId and Ticket: These columns are identifiers and not useful for modeling, so they are dropped.
Name: We extract the titles from the names (like Mr., Mrs.) and use them as a new feature, which may help with
predictions.
Validation Rules:
Check that 'Age' is between 0 and 100, 'Fare' is non-negative, 'Sex' contains only 'male' or 'female', and 'Embarked'
contains only valid values ('S', 'C', 'Q').
1. Missing numerical values in a column?
fill with median (which is a common approach for numerical columns
with missing data).
The median is often preferred over the mean for filling missing
values because it is less sensitive to outliers (extreme values) that
can skew the mean.
Question -2 : Data Preprocessing Techniques

Diabetes Dataset

This dataset is originally from the National Institute of Diabetes and


Digestive and Kidney Diseases. The objective of the dataset is to
diagnostically predict whether or not a patient has diabetes, based on
certain diagnostic measurements included in the dataset.
Skewed data refers to the situation where the distribution of a dataset is
not symmetrical but instead has a tail that extends more to one side. This
means that the data points are clustered more on one side of the
distribution, with a long tail on the other side.
Log Transformation: To normalize skewed data, log transformation (or
logarithmic transformation) is applied. This makes the data more symmetric
and easier for machine learning models to work with. Log transformation
reduces the impact of extreme values (outliers) by compressing the large
values while maintaining the smaller ones.
Question - 3: Apriori Algorithm

Grocery Transaction Dataset

We create a sample grocery transaction dataset containing different items.


What is it?

The Apriori algorithm is used to find patterns (called association rules) in


a set of data, like transactions in a supermarket or a collection of events.
It helps in identifying which items tend to appear together. For example, if
you buy bread, you may also be likely to buy butter. The algorithm helps
in discovering these types of patterns in data.
Itemset: This is simply a group of items that appear together in a
transaction. For example, in a supermarket, an itemset could be {bread,
butter}.

Frequent Itemset: These are itemsets that appear often in the dataset.
The Apriori algorithm finds these frequent itemsets to understand what
items are commonly bought together.

Association Rule: Once we find frequent itemsets, we can create rules


that suggest how one item is related to another. For example, if
someone buys bread, they are likely to also buy butter. This would be
an association rule: {bread} → {butter}.
Support:

Support tells us how common or frequent an itemset is in the entire


dataset.
Example: Imagine you have a list of transactions where people
bought different items. You want to know how often the
combination of {bread, butter} happens.
If 10 out of 100 customers bought both bread and butter, the
support for {bread, butter} is 10/100 = 0.1 (or 10%).
Formula: Support = (Number of transactions containing the itemset)
/ (Total number of transactions)
Confidence:
Confidence measures how reliable the association rule is. It tells
us how likely one item is to be purchased if the other item is
purchased.
Example: Let’s say the rule is {bread} → {butter} (If you buy
bread, you will also buy butter).
Confidence tells us the probability that if someone buys bread,
they will also buy butter.
If 8 out of 10 customers who bought bread also bought
butter, the confidence of the rule {bread} → {butter} is 8/10
= 0.8 (or 80%).
Formula: Confidence = (Number of transactions containing
both items) / (Number of transactions containing the first
item)
IQR Method (Interquartile Range Method)
The Interquartile Range (IQR) method is a technique used to identify and remove
outliers from a dataset. Outliers are data points that are significantly higher or lower
than the majority of the data.
What is IQR?
The IQR is a measure of statistical dispersion, or how spread out the data is. It is
calculated as the difference between the third quartile (Q3) and the first quartile (Q1) of
the data.
Q1 (First Quartile): The median of the lower half of the data. It is the 25th percentile.
Q3 (Third Quartile): The median of the upper half of the data. It is the 75th percentile.
IQR = Q3 - Q1: The range between the third and first quartiles.
Identifying Outliers with IQR:
Outliers are often defined as data points that fall outside the following range:
Lower Bound = Q1 - 1.5 * IQR
Upper Bound = Q3 + 1.5 * IQR
"Classification in Data Mining"

"Classification is a supervised learning technique that categorizes data into predefined


classes"

Real-world Applications:
Email filtering (Spam vs. Not Spam)
Medical diagnosis (Disease present vs. absent)
Credit card approval (Approve vs. Reject)
Image recognition (Cat vs. Dog)

Basic Process:
Training: Algorithm learns from labeled data
Testing: Model predicts labels for new data
Evaluation: Measure how well the model performs
Underfitting:
The model is too simple for the data.
It doesn’t learn enough from the training data.
Result: Poor performance on both training and testing datasets.
Cause: The model can’t capture the true patterns.

You only skim the chapter headings. You miss


important details and fail to answer most questions.
Overfitting:
The model is too complex for the data.
It memorizes the training data, including irrelevant details (noise).
Result: Great performance on training data but poor performance
on new data.
Cause: The model learns patterns that are specific to the training
data but not general enough.

You memorize the answers to past quiz questions


word-for-word. If the teacher asks something slightly
different, you get confused.
Good Learning (Best Fit):
You understand the key concepts and principles, so you can
answer both familiar and new questions confidently.
Underfitting: You draw a straight line that doesn’t capture the shape of the data.

Overfitting: You draw a squiggly line that passes through every point, even outliers.

Good Fit: You draw a smooth curve that captures the general pattern of the data.
Classification
Algorithms
"Naive Bayes: Probabilistic Classification"

Core Concept:
Based on Bayes' Theorem: P(A|B) = P(B|A) × P(A) / P(B)
Uses probability to predict class membership
Bayes Theorem calculates the probability of a hypothesis being true given some
observed evidence.
Prior Probabilty

Definition: The initial probability of an event occurring before any new evidence is
considered.
Think of it as "what we know before looking at any specific details"

Real-Life Example:
In a classroom of 100 students:
- 60 students pass
- 40 students fail

Prior Probabilities:
P(Pass) = 60/100 = 0.6
P(Fail) = 40/100 = 0.4

This is our "starting point" - before looking at any specific student's characteristics
Conditional Probability

Definition: The probability of an event occurring given that another event has already occurred.
Think of it as "what's the chance of A happening if we know B has happened"

Real-Life Example:
Out of the 60 students who passed:
- 45 attended all classes
- 15 missed some classes

Conditional Probability:
P(Attended All|Passed) = 45/60 = 0.75
This reads as "probability of attending all classes given that the student passed"

Similarly, out of 40 who failed:


- 10 attended all classes
- 30 missed some classes

P(Attended All|Failed) = 10/40 = 0.25


Formula: P(A|B) = P(B|A) × P(A) / P(B)
B
A In words:
Y Posterior = (Likelihood × Prior) / Evidence
E
S Real Example - Student Performance:
Question: If a student attended all classes, what's the probability they passed?
T
H We know:
E - P(Pass) = 0.6 [Prior]
O - P(Attended|Pass) = 0.75 [Likelihood]
R - P(Attended) = (45+10)/100 = 0.55 [Evidence]
E
M Using Bayes Theorem:
P(Pass|Attended) = (0.75 × 0.6) / 0.55 = 0.82

This means: 82% chance of passing if a student attended all classes


Movie Genre Prediction
Step 1: Calculate Prior Probabilities
Step 2: Calculate Conditional Probabilities
Step 3: Apply Bayes Theorem
Advantages
Disadvantages
Applications
The "Naive" Assumption:
All features contribute independently to the probability, regardless of any correlations
between features.

Real Example - Movie Classification:


Features:
1. Has explosions
2. Has car chases
3. Has gunfights

Reality:
These features are likely correlated (action movies often have all three)

Naive Bayes Assumes:


Each feature independently contributes to the movie being action/non-action
DECISION TREES

Basic Structure:
Root node (starting point)
Internal nodes (decisions)
Leaf nodes (final classifications)
Decision Trees: Tree-based Classification

Example: Loan Approval


Splitting Criteria
The goal of a Decision Tree is to split the data into smaller, more
homogeneous groups. The question is: How do we decide where
to split?

Two main splitting criteria are commonly used:


Gini Impurity
Entropy
Gini Impurity
Entropy and Information Gain
High entropy = The group has lots of different things (very uncertain).

Low entropy = The group has mostly one thing (very certain).
High Information Gain = The split made the data much more certain
(better decision).
Low Information Gain = The split didn’t make much of a difference (not
a great decision).
Imagine a box of fruit. If the box is filled with
90% apples and 10% oranges, we are fairly
certain that if we pick a fruit randomly, it will be
an apple. This box has low entropy.
But if the box has an equal mix of apples and
oranges, we are uncertain about what we'll pick.
This box has high entropy.
The Decision Tree algorithm follows a greedy approach.

It selects the best feature (based on the chosen splitting


criterion) at each step to split the data. The algorithm
continues to split the data at each node until:

All data points in a node belong to the same class (pure


node).
Or a stopping condition is met (e.g., a maximum depth or
minimum number of samples per leaf).
Summary
From Code:

max_depth: Maximum depth of the tree (limits how deep the tree can grow).

min_samples_split: Minimum number of samples required to split an internal node.

min_samples_leaf: Minimum number of samples required to be at a leaf node.

random_state: Controls the randomness of the model’s initialization.

You might also like