Practical Guide and Concepts Data Mining
Practical Guide and Concepts Data Mining
Titanic Dataset
Diabetes Dataset
Frequent Itemset: These are itemsets that appear often in the dataset.
The Apriori algorithm finds these frequent itemsets to understand what
items are commonly bought together.
Real-world Applications:
Email filtering (Spam vs. Not Spam)
Medical diagnosis (Disease present vs. absent)
Credit card approval (Approve vs. Reject)
Image recognition (Cat vs. Dog)
Basic Process:
Training: Algorithm learns from labeled data
Testing: Model predicts labels for new data
Evaluation: Measure how well the model performs
Underfitting:
The model is too simple for the data.
It doesn’t learn enough from the training data.
Result: Poor performance on both training and testing datasets.
Cause: The model can’t capture the true patterns.
Overfitting: You draw a squiggly line that passes through every point, even outliers.
Good Fit: You draw a smooth curve that captures the general pattern of the data.
Classification
Algorithms
"Naive Bayes: Probabilistic Classification"
Core Concept:
Based on Bayes' Theorem: P(A|B) = P(B|A) × P(A) / P(B)
Uses probability to predict class membership
Bayes Theorem calculates the probability of a hypothesis being true given some
observed evidence.
Prior Probabilty
Definition: The initial probability of an event occurring before any new evidence is
considered.
Think of it as "what we know before looking at any specific details"
Real-Life Example:
In a classroom of 100 students:
- 60 students pass
- 40 students fail
Prior Probabilities:
P(Pass) = 60/100 = 0.6
P(Fail) = 40/100 = 0.4
This is our "starting point" - before looking at any specific student's characteristics
Conditional Probability
Definition: The probability of an event occurring given that another event has already occurred.
Think of it as "what's the chance of A happening if we know B has happened"
Real-Life Example:
Out of the 60 students who passed:
- 45 attended all classes
- 15 missed some classes
Conditional Probability:
P(Attended All|Passed) = 45/60 = 0.75
This reads as "probability of attending all classes given that the student passed"
Reality:
These features are likely correlated (action movies often have all three)
Basic Structure:
Root node (starting point)
Internal nodes (decisions)
Leaf nodes (final classifications)
Decision Trees: Tree-based Classification
Low entropy = The group has mostly one thing (very certain).
High Information Gain = The split made the data much more certain
(better decision).
Low Information Gain = The split didn’t make much of a difference (not
a great decision).
Imagine a box of fruit. If the box is filled with
90% apples and 10% oranges, we are fairly
certain that if we pick a fruit randomly, it will be
an apple. This box has low entropy.
But if the box has an equal mix of apples and
oranges, we are uncertain about what we'll pick.
This box has high entropy.
The Decision Tree algorithm follows a greedy approach.
max_depth: Maximum depth of the tree (limits how deep the tree can grow).