0% found this document useful (0 votes)
16 views

ML notes

The document provides a comprehensive overview of machine learning (ML), detailing its subfields, types of learning (supervised, unsupervised, and reinforcement), and the steps involved in preparing and evaluating ML models. It discusses various algorithms, including classification and regression techniques, alongside their applications and evaluation metrics. Key concepts such as data preprocessing, feature engineering, and model performance assessment are also emphasized.

Uploaded by

arya khandekar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

ML notes

The document provides a comprehensive overview of machine learning (ML), detailing its subfields, types of learning (supervised, unsupervised, and reinforcement), and the steps involved in preparing and evaluating ML models. It discusses various algorithms, including classification and regression techniques, alongside their applications and evaluation metrics. Key concepts such as data preprocessing, feature engineering, and model performance assessment are also emphasized.

Uploaded by

arya khandekar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Introduction to Machine Learning: A

Comprehensive Overview
Machine learning (ML) is a subfield of artificial intelligence (AI) that empowers systems to
learn from data, identify patterns, and make decisions with minimal human intervention. It
transcends traditional programming by enabling computers to evolve their algorithms based
on experience.

1. Types of Learning: A Taxonomy

Machine learning algorithms are broadly categorized into three fundamental learning
paradigms:

• Supervised Learning:
o This paradigm involves training a model on a labeled dataset, where each
data point is associated with a corresponding output (label).
o The objective is to learn a mapping function that can accurately predict the
output for unseen data.
o Examples: Classification (predicting categorical labels) and Regression
(predicting continuous values).
o Key algorithms: Linear Regression, Logistic Regression, Support Vector
Machines (SVMs), Decision Trees, Random Forests, Neural Networks.
• Unsupervised Learning:
o Here, the model is trained on an unlabeled dataset, where no explicit output
labels are provided.
o The goal is to discover hidden patterns, structures, or relationships within the
data.
o Examples: Clustering (grouping similar data points), Dimensionality Reduction
(reducing the number of features), Association Rule Mining (discovering
relationships between variables).
o Key algorithms: K-Means Clustering, Hierarchical Clustering, Principal
Component Analysis (PCA), Autoencoders.
• Reinforcement Learning:
o This paradigm involves training an agent to interact with an environment and
learn optimal behavior through trial and error.
o The agent receives rewards or penalties for its actions, and the goal is to
maximize the cumulative reward.
o Examples: Game playing (e.g., AlphaGo), Robotics, Autonomous driving.
o Key concepts: Agent, Environment, State, Action, Reward, Policy.
2. Preparing a Machine Learning Model: A Rigorous Approach

The process of building a robust ML model involves several critical steps:

• Data Collection:
o Gathering relevant and representative data is paramount.
o Data quality and quantity significantly impact model performance.
o Consider data sources, sampling methods, and potential biases.
• Data Preprocessing:
o Real-world data is often noisy, incomplete, and inconsistent.
o Preprocessing techniques are essential to clean and transform the data into a
suitable format for ML algorithms.
o Key steps:
▪ Data Cleaning: Handling missing values (imputation, deletion),
removing outliers, correcting inconsistencies.
▪ Data Transformation: Normalization (scaling data to a common
range), Standardization (centering data around zero mean and unit
variance), Encoding categorical variables (one-hot encoding, label
encoding).
• Dimensionality Reduction:
o High-dimensional data can lead to the "curse of dimensionality," where
model complexity increases exponentially with the number of features.
o Dimensionality reduction techniques aim to reduce the number of features
while preserving essential information.
o Principal Component Analysis (PCA):
▪ A linear dimensionality reduction technique that transforms data into
a new coordinate system, where the principal components capture
the maximum variance.
▪ It identifies orthogonal vectors (principal components) that represent
the directions of maximum variance in the data.
▪ PCA can be used for feature extraction and data visualization.
• Feature Engineering:
o Creating new features from existing ones to improve model performance.
o Involves domain expertise and creativity.
o Examples: Polynomial features, interaction features, extracting features from
text or images.
3. Modelling and Evaluation: Ensuring Model Validity

• Model Selection:
o Choosing an appropriate ML algorithm based on the problem type, data
characteristics, and performance requirements.
o Consider the trade-off between model complexity and interpretability.
• Training a Machine Learning Model:
o Feeding the preprocessed data into the chosen algorithm and adjusting the
model parameters to minimize the error.
o Techniques: Gradient Descent, Stochastic Gradient Descent, Backpropagation
(for neural networks).
• Underfitting and Overfitting:
o Underfitting: Occurs when the model is too simple to capture the underlying
patterns in the data. High bias and high error on both training and test data.
o Overfitting: Occurs when the model is too complex and memorizes the
training data, leading to poor generalization on unseen data. Low bias, low
error on train data, and high error on test data.
o Bias: The difference between the average prediction of our model and the
correct value. High bias models pay very little attention to the training data
and oversimplify the model.
• Variance: The variability of model predictions for a given data point. High variance
models pay a lot of attention to the training data and do not generalize on the data
they have not seen before.
• Regularization: Techniques to prevent overfitting by adding a penalty term to the
model's loss function. Examples: L1 regularization (Lasso), L2 regularization (Ridge).
• Cross-Validation: A technique to assess model performance by partitioning the data
into multiple folds and training and evaluating the model on different combinations
of folds.
• Hyperparameter Tuning: Optimizing the model's hyperparameters (parameters that
are not learned from the data) to achieve the best performance.
4. Performance Evaluation Measures: Quantifying Model Effectiveness

• Classification:
o Accuracy: The proportion of correctly classified instances.
o Precision: The proportion of correctly predicted positive instances out of all
predicted positive instances.
o Recall (Sensitivity): The proportion of correctly predicted positive instances
out of all actual positive instances.
o F1-score: The harmonic mean of precision and recall.
o Confusion Matrix: A table that summarizes the performance of a
classification model.
o AUC-ROC: Area Under the Receiver Operating Characteristic curve.
• Regression:
o Mean Squared Error (MSE): The average squared difference between
predicted and actual values.
o Root Mean Squared Error (RMSE): The square root of MSE.
o Mean Absolute Error (MAE): The average absolute difference between
predicted and actual values.
o R-squared (Coefficient of Determination): The proportion of variance in the
dependent variable that is predictable from the independent variables.
Supervised Learning: Classification
1. Introduction

Supervised learning is a paradigm in machine learning where an algorithm learns a mapping


function from input features to output labels based on labeled training data. The
"supervised" aspect refers to the presence of a "supervisor" (the labeled data) that guides
the learning process.

• Core Idea: The aim is to build a model that can accurately predict the output label
for unseen input data.
• Key Components:
o Training Data: Consists of input features (independent variables) and
corresponding output labels (dependent variables).
o Model: A mathematical representation of the relationship between input
features and output labels.
o Learning Algorithm: A procedure that adjusts the model's parameters to
minimize the difference between predicted and actual labels.
o Prediction: Applying the trained model to new, unseen data to generate
output labels.
• Distinction from Regression: Classification deals with predicting categorical labels
(e.g., spam/not spam, cat/dog), while regression deals with predicting continuous
values (e.g., house prices, temperature).

2. Examples of Supervised Learning (Classification)

• Email Spam Detection:


o Input features: Words in the email, sender's address, subject line.
o Output label: Spam or not spam.
• Image Classification:
o Input features: Pixel values of an image.
o Output label: Object category (e.g., cat, dog, car).
• Medical Diagnosis:
o Input features: Patient symptoms, test results.
o Output label: Disease presence or absence.
• Credit Risk Assessment:
o Input features: financial history, credit score.
o Output label: Default or no default.
• Sentiment Analysis:
o Input features: text from social media, reviews.
o Output label: positive, negative, or neutral.
3. Classification Model

A classification model is a function that maps an input vector x to a discrete output label y.

• Mathematical Representation:
o y = f(x; θ), where:
▪ x is the input feature vector.
▪ y is the predicted class label.
▪ θ represents the model parameters.
▪ f() is the classification function.
• Types of Classification:
o Binary Classification: Two possible output labels (e.g., 0 or 1, true or false).
o Multiclass Classification: More than two possible output labels (e.g., cat, dog,
bird).
o Multilabel Classification: Multiple labels can be assigned to a single instance
(e.g., a movie can belong to multiple genres).
• Evaluation Metrics:
o Accuracy: The proportion of correctly classified instances.
o Precision: The proportion of correctly predicted positive instances out of all
predicted positives.
o Recall (Sensitivity): The proportion of correctly predicted positive instances
out of all actual positives.
o F1-score: The harmonic mean of precision and recall.
o Confusion Matrix: A table that summarizes the performance of a
classification model.

4. Common Classification Algorithms

4.1. k-Nearest Neighbors (kNN)

• Principle: Classifies a data point based on the majority class among its k nearest
neighbors in the feature space.
• Algorithm:
1. Calculate the distance between the query point and all training points.
2. Select the k nearest neighbors.
3. Assign the query point to the majority class among the k neighbors.
• Advantages: Simple to implement, non-parametric (no assumptions about data
distribution).
• Disadvantages: Computationally expensive for large datasets, sensitive to the choice
of k and distance metric, performs poorly in high-dimensional spaces.
• Considerations: Scaling of features is important.
4.2. Decision Tree

• Principle: Partitions the feature space into regions based on a series of decision
rules.
• Algorithm:
1. Select the best feature and threshold to split the data based on information
gain or Gini impurity.
2. Recursively apply the splitting process to the resulting subsets.
3. Continue until a stopping criterion is met (e.g., maximum depth, minimum
samples per leaf).
• Advantages: Easy to interpret, handles both categorical and numerical features, non-
parametric.
• Disadvantages: Prone to overfitting, sensitive to small variations in the data, can
create complex trees that are difficult to interpret.
• Considerations: Tree pruning is crucial to prevent overfitting.

4.3. Random Forest

• Principle: An ensemble learning method that combines multiple decision trees to


improve prediction accuracy and reduce overfitting.
• Algorithm:
1. Create multiple bootstrap samples (random samples with replacement) from
the training data.
2. Train a decision tree on each bootstrap sample, using a random subset of
features at each split.
3. Aggregate the predictions of all trees by majority voting.
• Advantages: Robust to overfitting, high accuracy, handles high-dimensional data,
provides feature importance estimates.
• Disadvantages: Less interpretable than single decision trees, computationally
expensive.
• Considerations: Tuning the number of trees and the number of features to consider
at each split is important.

4.4. Support Vector Machines (SVM)

• Principle: Finds a hyperplane that maximizes the margin between different classes in
the feature space.
• Algorithm:
1. Map the input data to a high-dimensional feature space using a kernel
function (e.g., linear, polynomial, radial basis function).
2. Find the optimal hyperplane that separates the classes with the largest
margin.
3. Classify new data points based on their position relative to the hyperplane.
• Advantages: Effective in high-dimensional spaces, robust to outliers, versatile due to
different kernel functions.
• Disadvantages: Computationally expensive for large datasets, difficult to interpret,
sensitive to the choice of kernel and hyperparameters.
• Considerations: Kernel selection and hyperparameter tuning are critical.

4.5. Naïve Bayes Classifier

• Principle: Based on Bayes' theorem, assumes that features are conditionally


independent given the class label.
• Algorithm:
1. Estimate the prior probabilities of each class.
2. Estimate the conditional probabilities of each feature given each class.
3. Apply Bayes' theorem to calculate the posterior probability of each class for a
new data point.
4. Assign the data point to the class with the highest posterior probability.
• Advantages: Simple and efficient, works well with high-dimensional data, performs
well with categorical features.
• Disadvantages: Strong independence assumption may not hold in real-world data,
sensitive to feature dependencies.
• Considerations: Suitable for text classification and spam filtering.

4.6. XGBoost (Extreme Gradient Boosting)

• Principle: An optimized gradient boosting algorithm that combines multiple weak


learners (decision trees) to create a strong learner.
• Algorithm:
1. Train a decision tree on the residuals (errors) of the previous trees.
2. Add the new tree to the ensemble, weighted by a learning rate.
3. Repeat the process until a stopping criterion is met.
• Advantages: High accuracy, handles missing values, robust to outliers, efficient
implementation.
• Disadvantages: Complex to tune, prone to overfitting if not properly regularized.
• Considerations: Hyperparameter tuning is crucial for optimal performance.

4.7. Ensemble Learning

• Principle: Combines multiple base learners to improve prediction accuracy and


robustness.
• Techniques:
o Bagging (Bootstrap Aggregating): Trains multiple base learners on bootstrap
samples and aggregates their predictions (e.g., Random Forest).
o Boosting: Trains base learners sequentially, with each learner focusing on the
errors of the previous learners (e.g., XGBoost, AdaBoost).
o Stacking: Trains multiple base learners and a meta-learner that combines
their predictions.
• Advantages: Improved accuracy, reduced variance, increased robustness.
• Disadvantages: Increased complexity, potential for overfitting if not properly
regularized.
Further Considerations:

• Feature Engineering: Critical for model performance.


• Hyperparameter Tuning: Essential for optimizing model parameters.
• Cross-Validation: Used to evaluate model performance and prevent overfitting.
• Bias-Variance Tradeoff: Balancing model complexity and generalization ability.
• Data Preprocessing: Handling missing values, scaling features, and encoding
categorical variables.
Supervised Learning: Regression
1. Introduction

Supervised learning regression is a fundamental machine learning paradigm where the goal
is to predict a continuous output variable based on one or more input variables. Unlike
classification, which deals with categorical outputs, regression focuses on quantifying
relationships and estimating numerical values.

• Core Idea: To learn a function that maps input features to a continuous output,
minimizing the error between predicted and actual values.
• Key Components:
o Training Data: Consists of input features (independent variables) and
corresponding continuous output values (dependent variables).
o Model: A mathematical representation of the relationship between input
features and output values.
o Loss Function: A function that quantifies the error between predicted and
actual values (e.g., mean squared error).
o Optimization Algorithm: A procedure that adjusts the model's parameters to
minimize the loss function.
o Prediction: Applying the trained model to new, unseen data to generate
continuous output values.
• Distinction from Classification: Regression predicts continuous values, while
classification predicts categorical labels.

2. Examples of Regression

• House Price Prediction:


o Input features: Square footage, number of bedrooms, location, age.
o Output: House price.
• Stock Price Forecasting:
o Input features: Historical stock prices, trading volume, economic indicators.
o Output: Future stock price.
• Sales Prediction:
o Input features: Advertising spending, seasonality, competitor prices.
o Output: Sales volume.
• Temperature Forecasting:
o Input features: Historical temperature data, weather patterns, atmospheric
pressure.
o Output: Future temperature.
• Demand Forecasting:
o Input features: Historical sales data, promotional activities, economic factors.
o Output: Future demand.
3. Common Regression Algorithms

3.1. Simple Linear Regression

• Principle: Models the relationship between a single input variable and a continuous
output variable using a linear equation.
• Mathematical Representation:
o y = β₀ + β₁x + ε, where:
▪ y is the output variable.
▪ x is the input variable.
▪ β₀ is the intercept.
▪ β₁ is the slope.
▪ ε is the error term.
• Algorithm:
o Estimate the parameters β₀ and β₁ that minimize the sum of squared errors
(least squares method).
• Assumptions:
o Linearity: The relationship between x and y is linear.
o Independence: The error terms are independent.
o Homoscedasticity: The variance of the error terms is constant.
o Normality: The error terms are normally distributed.
• Evaluation Metrics:
o Mean Squared Error (MSE).
o Root Mean Squared Error (RMSE).
o R-squared (coefficient of determination).

3.2. Multiple Linear Regression

• Principle: Extends simple linear regression to model the relationship between


multiple input variables and a continuous output variable.
• Mathematical Representation:
o y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε, where:
▪ y is the output variable.
▪ x₁, x₂, ..., xₚ are the input variables.
▪ β₀, β₁, β₂, ..., βₚ are the coefficients.
▪ ε is the error term.
• Algorithm:
o Estimate the parameters β₀, β₁, β₂, ..., βₚ that minimize the sum of squared
errors.
• Assumptions:
o Same as simple linear regression, but extended to multiple input variables.
o No multicollinearity: The input variables are not highly correlated.
• Evaluation Metrics:
o MSE, RMSE, Adjusted R-squared.
o F-statistic and p-values for overall model significance.
o t-statistics and p-values for individual coefficients.
• Feature Selection:
o Techniques like forward selection, backward elimination, and stepwise
regression are used to select the most relevant input variables.
o Regularization methods like Ridge and Lasso regression can also be used for
feature selection and to prevent overfitting.

3.3. Logistic Regression

• Principle: While the name includes "regression," logistic regression is primarily a


classification algorithm. However, it’s included here because it uses a linear model to
predict the probability of a binary outcome.
• Mathematical Representation:
o p(y=1|x) = 1 / (1 + exp(-z)), where:
▪ p(y=1|x) is the probability of the output being 1 given the input x.
▪ z = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ.
▪ exp() is the exponential function.
• Algorithm:
o Estimate the parameters β₀, β₁, β₂, ..., βₚ that maximize the likelihood of
observing the training data.
• Loss Function:
o Log loss (cross-entropy).
• Interpretation:
o The coefficients represent the change in the log-odds of the outcome for a
one-unit change in the input variable.
• Evaluation Metrics:
o Accuracy, precision, recall, F1-score, AUC-ROC.
• Important Note: Although it calculates a probability, the final result of logistic
regression is normally a binary classification, based on a threshold applied to the
probability.
3.4. Maximum Likelihood Estimation (MLE)

• Principle: A general statistical method for estimating the parameters of a model by


maximizing the likelihood of observing the training data.
• Application to Regression:
o MLE can be used to estimate the parameters of linear regression, logistic
regression, and other regression models.
o It involves defining a likelihood function that represents the probability of
observing the training data given the model parameters.
o The parameters are then estimated by finding the values that maximize the
likelihood function.
• Advantages:
o Provides consistent and efficient estimators under certain conditions.
o Applicable to a wide range of models.
• Considerations:
o Requires assumptions about the data distribution.
o Can be computationally intensive for complex models.
• Example for Linear Regression:
o Assuming the error terms are normally distributed, the likelihood function for
linear regression can be derived.
o Maximizing this likelihood function leads to the same parameter estimates as
the least squares method.
• Example for Logistic Regression:
o The logistic regression model uses MLE to determine the coefficents that
maximize the likelihood of the observed binary outcomes.

Further Considerations:

• Polynomial Regression: Models non-linear relationships by adding polynomial terms


to the linear model.
• Regularization (Ridge, Lasso, Elastic Net): Used to prevent overfitting and improve
generalization.
• Model Evaluation: Crucial for assessing model performance and selecting the best
model.
• Residual Analysis: Examining the residuals (errors) to check the assumptions of the
regression model.
• Feature Engineering: Creating new features from existing ones to improve model
performance.
• Cross-Validation: Used to estimate model performance and prevent overfitting.
• Handling Outliers: Outliers can significantly affect regression models.
• Time Series Regression: Models time-dependent data, accounting for temporal
dependencies.
Miscellaneous:
Introduction to Machine Learning

Different Types of Learning

• Supervised Learning: Learning from labeled data.


• Unsupervised Learning: Identifying patterns from unlabeled data.
• Reinforcement Learning: Learning through rewards and penalties.

Preparing a Machine Learning Model

• Data collection and preprocessing.


• Feature selection and engineering.

Data Preprocessing

• Handling missing values, data normalization, and standardization.

Dimensionality Reduction

• Reducing the number of features while retaining essential information.


• Principal Component Analysis (PCA): Converts correlated features into uncorrelated
principal components.

Feature Engineering

• Creating new features from raw data to improve model performance.

Modeling & Evaluation

• Choosing the right model for the problem.


• Splitting data into training and test sets.

Training a Machine Learning Model

• Optimization techniques like Gradient Descent.


• Hyperparameter tuning.

Underfitting vs. Overfitting

• Underfitting: Model is too simple, leading to poor performance.


• Overfitting: Model is too complex, memorizing noise instead of learning patterns.
Bias-Variance Tradeoff

• Finding the right balance between bias (error due to simplistic assumptions) and
variance (sensitivity to noise).

Performance Evaluation Measures

• Classification: Accuracy, Precision, Recall, F1-score.


• Regression: RMSE, MAE, R-squared.

Supervised Learning - Classification

Introduction

• Predicting categorical labels (e.g., spam detection, disease diagnosis).

Examples of Supervised Learning

• Email spam detection, fraud detection, image classification.

Classification Model

• Mapping input features to categorical outputs.

Common Classification Algorithms

• k-Nearest Neighbors (kNN):


o Stores training instances and predicts based on the majority vote of nearest
neighbors.
• Decision Tree:
o Tree-like structure to make decisions by splitting data.
• Random Forest:
o Ensemble of multiple decision trees to improve accuracy.
• Support Vector Machines (SVM):
o Finds the optimal hyperplane for classification.
• Naïve Bayes Classifier:
o Probabilistic classifier based on Bayes' theorem.
• XGBoost:
o Gradient boosting algorithm optimized for speed and performance.
• Ensemble Learning:
o Combining multiple models (Bagging, Boosting, Stacking) to improve results.
Supervised Learning - Regression

Introduction

• Predicting continuous values (e.g., predicting house prices, stock prices).

Examples of Regression

• Forecasting sales, predicting student grades.

Common Regression Algorithms

• Simple Linear Regression:


o Models the relationship between two variables using a straight line.
• Multiple Linear Regression:
o Extends simple linear regression to multiple independent variables.
• Logistic Regression:
o Used for binary classification (despite the name "regression").
• Maximum Likelihood Estimation (MLE):
o A statistical method to estimate parameters by maximizing the likelihood
function.

You might also like