Machine Learning – I[1]
Machine Learning – I[1]
Common examples:
For regression: Mean Squared Error (MSE)
𝐿(𝑦,𝑦^)=(𝑦−𝑦^)^2
For classification: 0-1 loss (indicates whether the prediction is correct)
4. Risk Function
For overall performance of the model across all possible inputs, an
expected risk (also called true risk) is calculated to measure the
average loss over the entire input space.
5. Empirical Risk Minimization (ERM)
Empirical risk minimization is a principle where a model minimizes the
average loss (or risk) over a finite dataset or on a given training set.
Since we cannot compute the true risk (which involves all possible data
points in the universe), we approximate it by minimizing the risk on the
given dataset.
𝑛
1
𝑅𝑒𝑚𝑝 𝑓 = 𝐿(𝑦𝑖 , 𝑓 𝑥𝑖 )
𝑛
𝑖=1
where, 𝐿(𝑦𝑖 , 𝑓 𝑥𝑖 ) is the loss function measuring the difference
between the true output 𝑦 𝑖 and the predicted output ℎ 𝑥𝑖 .
𝑛 is the number of training samples.
Visualizing ERM
• Imagine a target (the true function)
and a set of darts (models).
• Goal: to throw a dart as close to the
center of the target as possible.
•The Target: Represents the true underlying
function
•The Darts: Represent different models, each
with its own set of parameters.
•The Bullseye: Represents the optimal model,
which minimizes the loss.
If the dataset contains the following three
house prices and predicted prices:
House True Price ($) Predicted Price ($) Loss
A 250,000 240,000 100,000,000
B 300,000 310,000 100,000,000
C 200,000 180,000 400,000,000
1
𝑅𝑒𝑚𝑝 𝑓 = 100,000,000 + 100,000,000 + 400,00,000 = 200,000,000
3
Scenario: Predicting whether a customer will
buy a product based on age and income.
1. Input features (X):
• Age of the customer
• Income of the customer
2. Output (Y):
• Binary classification (0 = No, 1 = Yes)
3. Loss Function:
Let’s use the 0-1 loss function, where the loss is 0 if the prediction is correct
and 1 if it is incorrect:
𝐿(𝑦𝑖 , ℎ 𝑥𝑖 ) = 0 𝑖𝑓 𝑦𝑖 = ℎ 𝑥𝑖
1 𝑖𝑓 𝑦𝑖 ≠ ℎ 𝑥𝑖
4. Dataset (D): Suppose the dataset contains 5 customer records:
Age (X1) Income (X2) Will Buy (Y)
25 30k 1
35 50k 0
45 70k 0
30 40k 1
40 60k 1
• A decision tree model might split size into ranges like: 1000 150,000
1500 200,000
• If size < 1250, price = $150,000 2000 250,000
• If 1250 ≤ size < 1750, price = $200,000
• If size ≥ 1750, price = $250,000
Both models minimize empirical risk with different inductive biases.
8. Regularization
• Regularization is a technique used to control overfitting by adding a penalty term to the loss
function.
• It ensures that the model does not become too complex.
• L1 Regularization (Lasso): Adds the sum of absolute values of weights to the loss.
𝑝
Pr(𝑃 𝑐 ≠ ℎ ≤ ϵ) ≥ 1 − δ
Example - Spam email classification
Scenario: Suppose you want to classify whether an email is spam or
not.
• You collect 1,000 emails and train a classifier.
• Your goal is for the classifier to have an error rate of less than 5%
(ϵ=0.05) with 95% confidence (1−δ=0.95).
• If the model achieves this performance on the training set, PAC
learning guarantees that it will perform similarly well on new, unseen
emails, provided enough data is used.
PAC Learning Guarantees
Key Points in PAC Learning
• Error Bound (ϵ): The maximum allowed error on unseen data.
• Confidence Level (1−δ): The probability that the hypothesis will meet
the error bound.
• Sample Complexity: The number of samples required to achieve the
desired ϵ and δ.
• Hypothesis Space (H): The set of all possible functions the model can
choose from.
PAC Learnability
• A concept C is PAC-learnable if there exists an algorithm that can,
with high probability (1−δ), output a hypothesis h such that the error
of h is at most ϵ, given a sufficient number of samples.
• Error Definition: The true error of a hypothesis h is:
𝐸𝑟𝑟𝑜𝑟 ℎ = 𝑃 𝑥,𝑦 ~𝐷 ℎ 𝑥 ≠ 𝑦
Where:
𝑥, 𝑦 are drawn from the data distribution D.
ℎ 𝑥 ≠ 𝑦 indicates a misclassification.
Sample Complexity
• The number of training examples m required for PAC learning
depends on:
1. The size of the hypothesis space (H).
2. The error tolerance (ϵ).
3. The confidence level (δ).
1
𝑚≥ ( 𝐼𝑛 𝐻 + 𝐼𝑛 1/ δ)
𝜖
Where ∣H∣ is the size of the hypothesis space. Larger hypothesis spaces
require more samples to ensure a good generalization.
Example of PAC Learning (SPAM Classification)
• Hypothesis Space: Suppose you are using a linear classifier. The
hypothesis space H consists of all possible linear decision boundaries.
• Error Tolerance (ϵ): You want the model to classify with less than 5%
error.
• Confidence (1−δ): You want to be 95% confident in the result.
• Using the PAC formula for sample complexity:
• Suppose ∣H∣=10^6. To achieve ϵ=0.05 and δ=0.05:
1 6
1
𝑚≥ 𝐼𝑛 10 + 𝐼𝑛 ≈ 1320 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
0.05 0.05
This means you need at least 1,320 labeled samples to ensure the
classifier performs within the desired bounds.
Why PAC Learning is Important?
• Theoretical Guarantees: PAC learning provides a framework to
understand the conditions under which a model will generalize well.
• Guidance for Practice: It highlights the trade-off between model
complexity and data requirements.
• Robustness: It allows for the evaluation of different algorithms and
their ability to generalize.
Example2: Medium Built Person
• Training Set: Height and Weight of m individuals.
• Target: Whether the person is medium built or not.
Building Good Training Sets
1. Data Preprocessing
• Data preprocessing is the critical first step in machine learning, where
raw data is prepared to ensure that it is clean, structured, and ready
for analysis. It includes cleaning the data, transforming it, and
handling inconsistencies.
Why is it Important?
• Ensures better model accuracy and faster convergence.
• Reduces noise and irrelevant information that may mislead the
model.
Steps in Data Preprocessing:
• Removing Duplicates: Identical rows can distort patterns.
• Handling Missing Values: Replace missing data with:
• Mean/Median/Mode (numerical).
• Most frequent value (categorical).
• Outlier Detection: Identify and treat outliers using statistical methods
(e.g., IQR or z-scores).
Example
Scenario: Predicting house prices using a dataset
with missing values.
Size (sq ft) Bedrooms Price ($)
1200 3 300,000
1500 NaN 400,000
NaN 4 500,000
1800 4 NaN
Code for Handling Missing Data:
import pandas as pd
from sklearn.impute import SimpleImputer
# Example dataset
data = {
"Size (sq ft)": [1200, 1500, None, 1800],
"Bedrooms": [3, None, 4, 4],
"Price ($)": [300000, 400000, 500000, None]
}
df = pd.DataFrame(data)
# Categorical data
colors = ['Red', 'Blue', 'Green', 'Blue']
# One-Hot Encoding
one_hot_encoder = OneHotEncoder(sparse_output=False)
colors_one_hot = one_hot_encoder.fit_transform([[c] for c in colors])
# Label Encoding
label_encoder = LabelEncoder()
colors_label = label_encoder.fit_transform(colors)
# Sample dataset
X = [[1], [2], [3], [4], [5]]
y = [0, 1, 0, 1, 0]
# Example data
X = [[1, 2], [3, 4], [5, 6], [7, 8]]
# Apply PCA
pca = PCA(n_components=1)
X_reduced = pca.fit_transform(X)
print("Original Data:\n", X)
print("\nReduced Data:\n", X_reduced)