ML notes
ML notes
Comprehensive Overview
Machine learning (ML) is a subfield of artificial intelligence (AI) that empowers systems to
learn from data, identify patterns, and make decisions with minimal human intervention. It
transcends traditional programming by enabling computers to evolve their algorithms based
on experience.
Machine learning algorithms are broadly categorized into three fundamental learning
paradigms:
• Supervised Learning:
o This paradigm involves training a model on a labeled dataset, where each
data point is associated with a corresponding output (label).
o The objective is to learn a mapping function that can accurately predict the
output for unseen data.
o Examples: Classification (predicting categorical labels) and Regression
(predicting continuous values).
o Key algorithms: Linear Regression, Logistic Regression, Support Vector
Machines (SVMs), Decision Trees, Random Forests, Neural Networks.
• Unsupervised Learning:
o Here, the model is trained on an unlabeled dataset, where no explicit output
labels are provided.
o The goal is to discover hidden patterns, structures, or relationships within the
data.
o Examples: Clustering (grouping similar data points), Dimensionality Reduction
(reducing the number of features), Association Rule Mining (discovering
relationships between variables).
o Key algorithms: K-Means Clustering, Hierarchical Clustering, Principal
Component Analysis (PCA), Autoencoders.
• Reinforcement Learning:
o This paradigm involves training an agent to interact with an environment and
learn optimal behavior through trial and error.
o The agent receives rewards or penalties for its actions, and the goal is to
maximize the cumulative reward.
o Examples: Game playing (e.g., AlphaGo), Robotics, Autonomous driving.
o Key concepts: Agent, Environment, State, Action, Reward, Policy.
2. Preparing a Machine Learning Model: A Rigorous Approach
• Data Collection:
o Gathering relevant and representative data is paramount.
o Data quality and quantity significantly impact model performance.
o Consider data sources, sampling methods, and potential biases.
• Data Preprocessing:
o Real-world data is often noisy, incomplete, and inconsistent.
o Preprocessing techniques are essential to clean and transform the data into a
suitable format for ML algorithms.
o Key steps:
▪ Data Cleaning: Handling missing values (imputation, deletion),
removing outliers, correcting inconsistencies.
▪ Data Transformation: Normalization (scaling data to a common
range), Standardization (centering data around zero mean and unit
variance), Encoding categorical variables (one-hot encoding, label
encoding).
• Dimensionality Reduction:
o High-dimensional data can lead to the "curse of dimensionality," where
model complexity increases exponentially with the number of features.
o Dimensionality reduction techniques aim to reduce the number of features
while preserving essential information.
o Principal Component Analysis (PCA):
▪ A linear dimensionality reduction technique that transforms data into
a new coordinate system, where the principal components capture
the maximum variance.
▪ It identifies orthogonal vectors (principal components) that represent
the directions of maximum variance in the data.
▪ PCA can be used for feature extraction and data visualization.
• Feature Engineering:
o Creating new features from existing ones to improve model performance.
o Involves domain expertise and creativity.
o Examples: Polynomial features, interaction features, extracting features from
text or images.
3. Modelling and Evaluation: Ensuring Model Validity
• Model Selection:
o Choosing an appropriate ML algorithm based on the problem type, data
characteristics, and performance requirements.
o Consider the trade-off between model complexity and interpretability.
• Training a Machine Learning Model:
o Feeding the preprocessed data into the chosen algorithm and adjusting the
model parameters to minimize the error.
o Techniques: Gradient Descent, Stochastic Gradient Descent, Backpropagation
(for neural networks).
• Underfitting and Overfitting:
o Underfitting: Occurs when the model is too simple to capture the underlying
patterns in the data. High bias and high error on both training and test data.
o Overfitting: Occurs when the model is too complex and memorizes the
training data, leading to poor generalization on unseen data. Low bias, low
error on train data, and high error on test data.
o Bias: The difference between the average prediction of our model and the
correct value. High bias models pay very little attention to the training data
and oversimplify the model.
• Variance: The variability of model predictions for a given data point. High variance
models pay a lot of attention to the training data and do not generalize on the data
they have not seen before.
• Regularization: Techniques to prevent overfitting by adding a penalty term to the
model's loss function. Examples: L1 regularization (Lasso), L2 regularization (Ridge).
• Cross-Validation: A technique to assess model performance by partitioning the data
into multiple folds and training and evaluating the model on different combinations
of folds.
• Hyperparameter Tuning: Optimizing the model's hyperparameters (parameters that
are not learned from the data) to achieve the best performance.
4. Performance Evaluation Measures: Quantifying Model Effectiveness
• Classification:
o Accuracy: The proportion of correctly classified instances.
o Precision: The proportion of correctly predicted positive instances out of all
predicted positive instances.
o Recall (Sensitivity): The proportion of correctly predicted positive instances
out of all actual positive instances.
o F1-score: The harmonic mean of precision and recall.
o Confusion Matrix: A table that summarizes the performance of a
classification model.
o AUC-ROC: Area Under the Receiver Operating Characteristic curve.
• Regression:
o Mean Squared Error (MSE): The average squared difference between
predicted and actual values.
o Root Mean Squared Error (RMSE): The square root of MSE.
o Mean Absolute Error (MAE): The average absolute difference between
predicted and actual values.
o R-squared (Coefficient of Determination): The proportion of variance in the
dependent variable that is predictable from the independent variables.
Supervised Learning: Classification
1. Introduction
• Core Idea: The aim is to build a model that can accurately predict the output label
for unseen input data.
• Key Components:
o Training Data: Consists of input features (independent variables) and
corresponding output labels (dependent variables).
o Model: A mathematical representation of the relationship between input
features and output labels.
o Learning Algorithm: A procedure that adjusts the model's parameters to
minimize the difference between predicted and actual labels.
o Prediction: Applying the trained model to new, unseen data to generate
output labels.
• Distinction from Regression: Classification deals with predicting categorical labels
(e.g., spam/not spam, cat/dog), while regression deals with predicting continuous
values (e.g., house prices, temperature).
A classification model is a function that maps an input vector x to a discrete output label y.
• Mathematical Representation:
o y = f(x; θ), where:
▪ x is the input feature vector.
▪ y is the predicted class label.
▪ θ represents the model parameters.
▪ f() is the classification function.
• Types of Classification:
o Binary Classification: Two possible output labels (e.g., 0 or 1, true or false).
o Multiclass Classification: More than two possible output labels (e.g., cat, dog,
bird).
o Multilabel Classification: Multiple labels can be assigned to a single instance
(e.g., a movie can belong to multiple genres).
• Evaluation Metrics:
o Accuracy: The proportion of correctly classified instances.
o Precision: The proportion of correctly predicted positive instances out of all
predicted positives.
o Recall (Sensitivity): The proportion of correctly predicted positive instances
out of all actual positives.
o F1-score: The harmonic mean of precision and recall.
o Confusion Matrix: A table that summarizes the performance of a
classification model.
• Principle: Classifies a data point based on the majority class among its k nearest
neighbors in the feature space.
• Algorithm:
1. Calculate the distance between the query point and all training points.
2. Select the k nearest neighbors.
3. Assign the query point to the majority class among the k neighbors.
• Advantages: Simple to implement, non-parametric (no assumptions about data
distribution).
• Disadvantages: Computationally expensive for large datasets, sensitive to the choice
of k and distance metric, performs poorly in high-dimensional spaces.
• Considerations: Scaling of features is important.
4.2. Decision Tree
• Principle: Partitions the feature space into regions based on a series of decision
rules.
• Algorithm:
1. Select the best feature and threshold to split the data based on information
gain or Gini impurity.
2. Recursively apply the splitting process to the resulting subsets.
3. Continue until a stopping criterion is met (e.g., maximum depth, minimum
samples per leaf).
• Advantages: Easy to interpret, handles both categorical and numerical features, non-
parametric.
• Disadvantages: Prone to overfitting, sensitive to small variations in the data, can
create complex trees that are difficult to interpret.
• Considerations: Tree pruning is crucial to prevent overfitting.
• Principle: Finds a hyperplane that maximizes the margin between different classes in
the feature space.
• Algorithm:
1. Map the input data to a high-dimensional feature space using a kernel
function (e.g., linear, polynomial, radial basis function).
2. Find the optimal hyperplane that separates the classes with the largest
margin.
3. Classify new data points based on their position relative to the hyperplane.
• Advantages: Effective in high-dimensional spaces, robust to outliers, versatile due to
different kernel functions.
• Disadvantages: Computationally expensive for large datasets, difficult to interpret,
sensitive to the choice of kernel and hyperparameters.
• Considerations: Kernel selection and hyperparameter tuning are critical.
Supervised learning regression is a fundamental machine learning paradigm where the goal
is to predict a continuous output variable based on one or more input variables. Unlike
classification, which deals with categorical outputs, regression focuses on quantifying
relationships and estimating numerical values.
• Core Idea: To learn a function that maps input features to a continuous output,
minimizing the error between predicted and actual values.
• Key Components:
o Training Data: Consists of input features (independent variables) and
corresponding continuous output values (dependent variables).
o Model: A mathematical representation of the relationship between input
features and output values.
o Loss Function: A function that quantifies the error between predicted and
actual values (e.g., mean squared error).
o Optimization Algorithm: A procedure that adjusts the model's parameters to
minimize the loss function.
o Prediction: Applying the trained model to new, unseen data to generate
continuous output values.
• Distinction from Classification: Regression predicts continuous values, while
classification predicts categorical labels.
2. Examples of Regression
• Principle: Models the relationship between a single input variable and a continuous
output variable using a linear equation.
• Mathematical Representation:
o y = β₀ + β₁x + ε, where:
▪ y is the output variable.
▪ x is the input variable.
▪ β₀ is the intercept.
▪ β₁ is the slope.
▪ ε is the error term.
• Algorithm:
o Estimate the parameters β₀ and β₁ that minimize the sum of squared errors
(least squares method).
• Assumptions:
o Linearity: The relationship between x and y is linear.
o Independence: The error terms are independent.
o Homoscedasticity: The variance of the error terms is constant.
o Normality: The error terms are normally distributed.
• Evaluation Metrics:
o Mean Squared Error (MSE).
o Root Mean Squared Error (RMSE).
o R-squared (coefficient of determination).
Further Considerations:
Data Preprocessing
Dimensionality Reduction
Feature Engineering
• Finding the right balance between bias (error due to simplistic assumptions) and
variance (sensitivity to noise).
Introduction
Classification Model
Introduction
Examples of Regression