UNIT I Introduction To Machine Learning
UNIT I Introduction To Machine Learning
Introductio
n to
Machine
Learning
KUNAL AHIRE
What is Machine Learning?
Originally developed as a subfield of Artificial Intelligence (AI), one of the goals
behind machine learning was to replace the need for developing computer programs
“manually."
Considering that programs are being developed to automate processes, we can think
of machine learning as the process of “automating automation.”
In other words, machine learning lets computers “create" programs (often, the intent
for developing these programs is making predictions) themselves.
In other words, machine learning is the process of turning data into programs.
Machine learning is the field of study that gives computers the ability to learn
without being explicitly programmed.
Machine Learning Vs Classic
Programming
Tom Mitchell's description
A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E - Tom Mitchell, Machine Learning
Professor at Carnegie Mellon University
To illustrate this quote with an example, consider the problem of recognizing
handwritten digits:
Task T: classifying handwritten digits from images
Performance measure P: percentage of digits classified correctly
Training experience E: dataset of digits given classifications, e.g., MNIST
Tom Mitchell's description
Classic and Adaptive Machines
Since time immemorial, human beings have built tools and machines to
simplify their work and reduce the overall effort needed to complete many
different tasks.
A machine is immediately considered useful and destined to be continuously
improved if its users can easily understand what tasks can be completed with
less effort or completely automatically.
In the latter case, some intelligence seems to appear next to cogs, wheels, or
axles. So a further step can be added to our evolution list: automatic machines,
built (nowadays we'd say programmed) to accomplish specific goals by
transforming energy into work.
Classic and Adaptive Machines
In the following figure, there's a generic representation of a classical system
that receives some input values, processes them, and produces output results:
Classic and Adaptive Machines
Programmable computers are widespread, flexible, and more and more
powerful instruments; moreover, the diffusion of the internet allowed us to
share software applications and related information with minimal effort.
The word-processing software that I'm using, my email client, a web browser,
and many other common tools running on the same machine are all examples
of such flexibility.
It's undeniable that the IT revolution dramatically changed our lives and
sometimes improved our daily jobs, but without machine learning (and all its
applications), there are still many tasks that seem far out of computer domain.
Classic and Adaptive Machines
Spam filtering, Natural Language Processing, visual tracking with a webcam or
a smartphone, and predictive analysis are only a few applications that
revolutionized human-machine interaction and increased our expectations.
In many cases, they transformed our electronic tools into actual cognitive
extensions that are changing the way we interact with many daily situations.
They achieved this goal by filling the gap between human perception,
language, reasoning, and model and artificial instruments.
Classic and Adaptive Machines
Here's a schematic representation of an adaptive system:
Machine Learning Cycle
Problem Understanding
Data Collection
Data Preprocessing
Model Selection
Model Building
Model Evaluation
Model Tuning
Deployment
Monitoring
Applications of Machine Learning
After the field of machine learning was \founded" more than a half a century ago, we
can now find applications of machine learning in almost every aspect of hour life.
Popular applications of machine learning include the following:
Email spam detection
Face detection and matching (e.g., iPhone X)
Web search (e.g., DuckDuckGo, Bing, Google)
Sports predictions
Post office (e.g., sorting letters by zip codes)
ATMs (e.g., reading checks)
Applications of Machine Learning
Credit card fraud
Stock predictions
Smart assistants (Apple Siri, Amazon Alexa, . . . )
Product recommendations (e.g., Netflix, Amazon)
Self-driving cars (e.g., Uber, Tesla)
Language translation (Google Translate)
Sentiment analysis
Drug design
Medical diagnoses
Types of machine learning
algorithms
Regardless of whether the learner is a human or machine, the basic learning process is similar.
It can be divided into four interrelated components:
Data storage utilizes observation, memory, and recall to provide a factual basis for further
reasoning.
Abstraction involves the translation of stored data into broader representations and concepts.
Generalization uses abstracted data to create knowledge and inferences that drive action in
new contexts.
Evaluation provides a feedback mechanism to measure the utility of learned knowledge and
inform potential improvements.
Machine learning algorithms are divided into categories according to their purpose.
Types of machine learning
algorithms
Main categories are
• Supervised learning (predictive model, "labeled" data)
• classification (Logistic Regression, Decision Tree, KNN, Random Forest, SVM, Naive Bayes, etc)
• numeric prediction (Linear Regression, KNN, Gradient Boosting & AdaBoost, etc)
• Reinforcement learning. Using this algorithm, the machine is trained to make specific decisions. It works
this way: the machine is exposed to an environment where it trains itself continually using trial and error.
This machine learns from past experience and tries to capture the best possible knowledge to make
accurate business decisions. Example of Reinforcement Learning: Markov Decision Process.
Types of machine learning
algorithms
Supervised Machine Learning
Supervised Machine Learning
Supervised learning is defined as when a model gets trained on a
“Labelled Dataset”. Labelled datasets have both input and output
parameters.
In Supervised Learning algorithms learn to map points between inputs
and correct outputs.
It has both training and validation datasets labelled.
Supervised Machine Learning
Supervised Machine Learning
Example:
Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs.
If you feed the datasets of dogs and cats’ labelled images to the algorithm, the
machine will learn to classify between a dog or a cat from these labeled images.
When we input new dog or cat images that it has never seen before, it will use
the learned algorithms and predict whether it is a dog or a cat.
This is how supervised learning works and this is particularly an image
classification.
Supervised Machine Learning
There are two main categories of supervised learning that are mentioned
below:
Classification
Regression
Supervised Machine Learning
Classification
Classification deals with predicting categorical target variables, which
represent discrete classes or labels.
For instance, classifying emails as spam or not spam, or predicting
whether a patient has a high risk of heart disease.
Classification algorithms learn to map the input features to one of the
predefined classes.
Supervised Machine Learning
Here are some classification algorithms:
Logistic Regression
Support Vector Machine
Random Forest
Decision Tree
K-Nearest Neighbors (KNN)
Naive Bayes
Supervised Machine Learning
Regression
Regression, on the other hand, deals with predicting continuous target
variables, which represent numerical values.
For example, predicting the price of a house based on its size, location,
and amenities, or forecasting the sales of a product.
Regression algorithms learn to map the input features to a continuous
numerical value.
Supervised Machine Learning
Here are some regression algorithms:
Linear Regression
Polynomial Regression
Ridge Regression
Lasso Regression
Decision tree
Random Forest
Advantages of Supervised
Machine Learning
Supervised Learning models can have high accuracy as they are trained
on labelled data.
The process of decision-making in supervised learning models is often
interpretable.
It can often be used in pre-trained models which saves time and
resources when developing new models from scratch.
Disadvantages of Supervised
Machine Learning
It has limitations in knowing patterns and may struggle with unseen or
unexpected patterns that are not present in the training data.
It can be time-consuming and costly as it relies on labeled data only.
It may lead to poor generalizations based on new data.
Applications of Supervised
Learning
• Image classification: Identify objects, faces, and other •Autonomous vehicles: Recognize and respond to
features in images. objects in the environment.
• Natural language processing: Extract information from •Email spam detection: Classify emails as spam or not
text, such as sentiment, entities, and relationships. spam.
• Speech recognition: Convert spoken language into text. •Quality control in manufacturing: Inspect products for
defects.
• Recommendation systems: Make personalized
recommendations to users. •Credit scoring: Assess the risk of a borrower defaulting
on a loan.
• Predictive analytics: Predict outcomes, such as sales,
customer churn, and stock prices. •Gaming: Recognize characters, analyze player behavior,
and create NPCs.
• Medical diagnosis: Detect diseases and other medical
conditions. •Customer support: Automate customer support tasks.
•Weather forecasting: Make predictions for
• Fraud detection: Identify fraudulent transactions.
temperature, precipitation, and other meteorological
parameters.
•Sports analytics: Analyze player performance, make
Unsupervised Machine Learning
Unsupervised Learning
Unsupervised learning is a type of machine learning technique in which an
algorithm discovers patterns and relationships using unlabeled data.
Unlike supervised learning, unsupervised learning doesn’t involve
providing the algorithm with labeled target outputs.
The primary goal of Unsupervised learning is often to discover hidden
patterns, similarities, or clusters within the data, which can then be used for
various purposes, such as data exploration, visualization, dimensionality
reduction, and more.
Unsupervised Machine Learning
Unsupervised Machine Learning
Example
Consider that you have a dataset that contains information about the
purchases you made from the shop.
Through clustering, the algorithm can group the same purchasing
behavior among you and other customers, which reveals potential
customers without predefined labels.
This type of information can help businesses get target customers as
well as identify outliers.
Unsupervised Machine Learning
Two main categories of unsupervised learning are mentioned
below:
Clustering
Association
Unsupervised Machine Learning
Clustering
Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for
labeled examples.
Here are some clustering algorithms:
K-Means Clustering algorithm
Mean-shift algorithm
DBSCAN Algorithm
Principal Component Analysis
Independent Component Analysis
Unsupervised Machine Learning
Association
Association rule learning is a technique for discovering relationships between items
in a dataset.
It identifies rules that indicate the presence of one item implies the presence of
another item with a specific probability.
Here are some association rule learning algorithms:
Apriori Algorithm
Eclat
FP-growth Algorithm
Advantages of Unsupervised
Machine Learning
It helps to discover hidden patterns and various relationships between
the data.
Used for tasks such as customer segmentation, anomaly detection, and
data exploration.
It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised
Machine Learning
Without using labels, it may be difficult to predict the quality of the
model’s output.
Cluster Interpretability may not be clear and may not have meaningful
interpretations.
It has techniques such as autoencoders and dimensionality reduction
that can be used to extract meaningful features from raw data.
Applications of Unsupervised
Learning
• Clustering: Group similar data points into clusters. •Data preprocessing: Help with data preprocessing tasks
• Anomaly detection: Identify outliers or anomalies in data. such as data cleaning, imputation of missing values, and
data scaling.
• Dimensionality reduction: Reduce the dimensionality of
•Market basket analysis: Discover associations between
data while preserving its essential information.
products.
• Recommendation systems: Suggest products, movies, or
•Genomic data analysis: Identify patterns or group genes
content to users based on their historical behavior or
preferences. with similar expression profiles.
•Image segmentation: Segment images into meaningful
• Topic modeling: Discover latent topics within a collection of
documents. regions.
•Community detection in social networks: Identify
• Density estimation: Estimate the probability density
function of data. communities or groups of individuals with similar interests or
connections.
• Image and video compression: Reduce the amount of
storage required for multimedia content. •Customer behavior analysis: Uncover patterns and
insights for better marketing and product recommendations.
• Exploratory data analysis (EDA): Explore data and gain
insights before defining specific tasks. •Content recommendation: Classify and tag content to
Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works
between the supervised and unsupervised learning so it uses both labelled
and unlabelled data.
It’s particularly useful when obtaining labeled data is costly, time-
consuming, or resource-intensive.
This approach is useful when the dataset is expensive and time-
consuming.
Semi-supervised learning is chosen when labeled data requires skills and
relevant resources in order to train or learn from it.
Semi-Supervised Learning
We use these techniques when we are dealing with data that is a little
bit labeled and the rest large portion of it is unlabeled.
We can use the unsupervised techniques to predict labels and then
feed these labels to supervised techniques.
This technique is mostly applicable in the case of image data sets
where usually all images are not labeled.
Semi-Supervised Learning
Semi-Supervised Learning
Example:
Consider that we are building a language translation model, having
labeled translations for every sentence pair can be resource-intensive.
It allows the models to learn from labeled and unlabeled sentence pairs,
making them more accurate.
This technique has led to significant improvements in the quality of
machine translation services.
Advantages of Semi- Supervised
Machine Learning
It leads to better generalization as compared to supervised learning, as
it takes both labeled and unlabeled data.
Can be applied to a wide range of data.
Disadvantages of Semi-
Supervised Machine Learning
Semi-supervised methods can be more complex to implement
compared to other approaches.
It still requires some labeled data that might not always be available or
easy to obtain.
The unlabeled data can impact the model performance accordingly.
Applications of Semi-Supervised
Learning
Image Classification and Object Recognition: Improve the accuracy of models by
combining a small set of labeled images with a larger set of unlabeled images.
Natural Language Processing (NLP): Enhance the performance of language models and
classifiers by combining a small set of labeled text data with a vast amount of unlabeled text.
Speech Recognition: Improve the accuracy of speech recognition by leveraging a limited
amount of transcribed speech data and a more extensive set of unlabeled audio.
Recommendation Systems: Improve the accuracy of personalized recommendations by
supplementing a sparse set of user-item interactions (labeled data) with a wealth of unlabeled
user behavior data.
Healthcare and Medical Imaging: Enhance medical image analysis by utilizing a small set of
labeled medical images alongside a larger set of unlabeled images.
Reinforcement Machine Learning
Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors.
Trial, error, and delay are the most relevant characteristics of reinforcement learning.
In this technique, the model keeps on increasing its performance using Reward Feedback to learn
the behavior or pattern.
These algorithms are specific to a particular problem e.g. Google Self Driving car, AlphaGo where
a bot competes with humans and even itself to get better and better performers in Go Game.
Each time we feed in data, they learn and add the data to their knowledge which is training data.
So, the more it learns the better it gets trained and hence experienced.
Reinforcement Machine Learning
Here are some of the most common reinforcement learning algorithms:
Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function, which
maps states to actions. The Q-function estimates the expected reward of taking a
particular action in a given state.
SARSA (State-Action-Reward-State-Action): SARSA is another model-free RL
algorithm that learns a Q-function. However, unlike Q-learning, SARSA updates the Q-
function for the action that was actually taken, rather than the optimal action.
Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep learning.
Deep Q-learning uses a neural network to represent the Q-function, which allows it to
learn complex relationships between states and actions.
Reinforcement Machine Learning
Reinforcement Machine Learning
Example:
Consider that you are training an AI agent to play a game like chess.
The agent explores different moves and receives positive or negative feedback based
on the outcome.
Reinforcement Learning also finds applications in which they learn to perform tasks by
interacting with their surroundings.
Types Reinforcement Machine
Learning
Positive reinforcement
Rewards the agent for taking a desired action.
Examples: Giving a treat to a dog for sitting, providing a point in a game for a correct answer.
Negative reinforcement
Removes an undesirable stimulus to encourage a desired behavior.
Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty by completing
a task.
Advantages of Reinforcement
Machine Learning
It has autonomous decision-making that is well-suited for tasks and that can
learn to make a sequence of decisions, like robotics and game-playing.
This technique is preferred to achieve long-term results that are very difficult
to achieve.
It is used to solve a complex problems that cannot be solved by conventional
techniques.
Disadvantages of Reinforcement
Machine Learning
Training Reinforcement Learning agents can be computationally expensive and
time-consuming.
Reinforcement learning is not preferable to solving simple problems.
It needs a lot of data and a lot of computation, which makes it impractical and
costly.
Applications of Reinforcement
Machine Learning
• Game Playing: RL can teach agents to play games, even •Supply Chain and Inventory Management: RL can be
complex ones.
used to optimize supply chain operations.
• Robotics: RL can teach robots to perform tasks •Energy Management: RL can be used to optimize
autonomously.
energy consumption.
• Autonomous Vehicles: RL can help self-driving cars •Game AI: RL can be used to create more intelligent and
navigate and make decisions.
adaptive NPCs in video games.
• Recommendation Systems: RL can enhance •Adaptive Personal Assistants: RL can be used to
recommendation algorithms by learning user preferences. improve personal assistants.
• Healthcare: RL can be used to optimize treatment plans •Virtual Reality (VR) and Augmented Reality
and drug discovery. (AR): RL can be used to create immersive and interactive
• Natural Language Processing (NLP): RL can be used in experiences.
dialogue systems and chatbots. •Industrial Control: RL can be used to optimize industrial
• Finance and Trading: RL can be used for algorithmic processes.
trading •Education: RL can be used to create adaptive learning
• Agriculture: RL can be used to optimize agricultural systems.
Disadvantages of Reinforcement
Machine Learning
Training Reinforcement Learning agents can be computationally expensive and
time-consuming.
Reinforcement learning is not preferable to solving simple problems.
It needs a lot of data and a lot of computation, which makes it impractical and
costly.
Important Elements of Machine
Learning
There are six elements of machine learning :
1. Data
2. Defining a Task
3. Applying Model
4. Calculating Loss
5. Learning Algorithm
6. Evaluation
Important Elements of Machine
Learning
Data (The fossil fuel of machine learning):
Data means information. All types and formats of information.
Today, there is an enormous amount of data produced every second, which can be used
to answer so many questions.
There is text data as well as audio-video data, there is structured data as well as
unstructured data.
One important thing to remember is that it doesn't matter in which format you get the
data, at the end all the data needs to be encoded as numbers before feeding it to the
computers.
Important Elements of Machine
Learning
A typical dataset required to perform any ML prediction is of high
dimensions meaning it consists of millions of rows/data
points/observations with typically thousands or maybe millions of
columns/parameters/features.
A dataset could be presented with inputs as well as its
corresponding outputs which is ideal for performing any supervised
learning task by learning the relationship between i/p and o/p and if
the dataset doesn’t contain any corresponding output w.r.t to inputs
then we can only perform unsupervised learning task.
Important Elements of Machine
Learning
A data could be structured(represented in the tabular form e.g. — sales
data/file records) and unstructured(incoming feed on social media
websites)
Important Elements of Machine
Learning
The data that has to be fed to the model for training should be in
machine-readable form i.e. it should be encoded as a number.
E.g. -
1.Certain text data like reviews should be represented in the numerical format
using one-hot encoding.
2.Image data can be represented in RGB format.
3.Video (a collection of frames) can be represented in numerical format.
4.Speech data could be represented in the numerical format using variation in
amplitude.
Important Elements of Machine
Learning
Task ( Setting an objective of the ML project with the curated
dataset):
Based on the procured/curated dataset we can define our task
accordingly.
If we have labeled training dataset i.e. it contains input(x’s) and its
corresponding labels(y’s) we can easily perform supervised
learning(classification/regression) and if we don’t have labeled training
dataset doesn’t contains corresponding labels(y’s) we can only perform
unsupervised learning(clustering/generation).
Important Elements of Machine
Learning
1. Nominal data
2. Ordinal data
Qualitative data
Nominal data is one which has no numeric value, but a named value. It is used
for assigning named values to attributes. Nominal values cannot be quantified.
Examples of nominal data are
1. Interval data
2. Ratio data
Quantitative data
Interval data is numeric data for which not only the order is known, but
the exact difference between values is also known.
An ideal example of interval data is Celsius temperature.
The difference between each value remains the same in Celsius
temperature.
For example, the difference between 12°C and 18°C degrees is measurable
and is 6°C as in the case of difference between 15.5°C and 21.5°C.
Other examples include date, time, etc.
Quantitative data
For interval data, mathematical operations such as addition and
subtraction are possible.
For that reason, for interval data, the central tendency can be
measured by mean, median, or mode.
Standard deviation can also be calculated.
Quantitative data
However, interval data do not have something called a ‘true zero’ value.
For example, there is nothing called ‘0 temperature’ or ‘no temperature’.
Hence, only addition and subtraction applies for interval data.
The ratio cannot be applied.
This means, we can say a temperature of 40°C is equal to the
temperature of 20°C + temperature of 20°C.
However, we cannot say the temperature of 40°C means it is twice as
hot as in temperature of 20°C.
Quantitative data
Ratio data represents numeric data for which exact value can be
measured. Absolute zero is available for ratio data.
Also, these variables can be added, subtracted, multiplied, or divided.
The central tendency can be measured by mean, median, or mode and
methods of dispersion such as standard deviation.
Examples of ratio data include height, weight, age, salary, etc.
Quantitative data
Figure gives a summarized view of different types of data that we may
find in a typical machine learning problem.
Feature Selection
Feature selection is the process of choosing the most important and relevant features
from your data that contribute the most to predicting the target variable.
In simple terms, it's like picking out the most useful pieces of information from a large
set to make a better and more efficient model.
Imagine you are trying to determine if an apple leaf is healthy or diseased based on
several features such as color, size, texture, and shape.
If some features don't really help in making this determination (like the size might not be
as indicative as the color or texture), you can ignore them.
By focusing only on the most useful features, you can build a model that is simpler,
faster, and often more accurate.
Feature Selection
Feature Selection
The role of feature selection in machine learning is,
- Filter methods
- Wrapper methods
- Embedded methods
Filter Methods
These methods are generally used while doing the pre-processing step.
These methods select features from the dataset irrespective of the use of
any machine learning algorithm.
In terms of computation, they are very fast and inexpensive and are very
good for removing duplicated, correlated, redundant features but these
methods do not remove multicollinearity.
Selection of feature is evaluated individually which can sometimes help
when features are in isolation but will lag when a combination of features
can lead to increase in the overall performance of the model.
Filter Methods
Filter Methods
Implementation
Some techniques used are:
Information Gain – It is defined as the amount of information provided by the
feature for identifying the target value and measures reduction in the entropy
values. Information gain of each attribute is calculated considering the target
values for feature selection.
Chi-square test — Chi-square method (X2) is generally used to test the
relationship between categorical variables. It compares the observed values from
different attributes of the dataset to its expected value.
Some techniques used are:
Fisher’s Score – Fisher’s Score selects each feature independently according
to their scores under Fisher criterion leading to a suboptimal set of features. The
larger the Fisher’s score is, the better is the selected feature.
Correlation Coefficient – Pearson’s Correlation Coefficient is a measure of
quantifying the association between the two continuous variables and the
direction of the relationship with its values ranging from -1 to 1.
Variance Threshold – It is an approach where all features are removed
whose variance doesn’t meet the specific threshold. By default, this method
removes features having zero variance. The assumption made using this
method is higher variance features are likely to contain more information.
Some techniques used are:
Mean Absolute Difference (MAD) – This method is similar to variance threshold
method but the difference is there is no square in MAD. This method calculates the mean
absolute difference from the mean value.
Dispersion Ratio – Dispersion ratio is defined as the ratio of the Arithmetic mean (AM)
to that of Geometric mean (GM) for a given feature. Its value ranges from +1 to ∞ as AM
≥ GM for a given feature. Higher dispersion ratio implies a more relevant feature.
Mutual Dependence – This method measures if two variables are mutually dependent,
and thus provides the amount of information obtained for one variable on observing the
other variable. Depending on the presence/absence of a feature, it measures the amount
of information that feature contributes to making the target prediction.
Wrapper methods:
Wrapper methods, also referred to as greedy algorithms train the algorithm by
using a subset of features in an iterative manner.
Based on the conclusions made from training in prior to the model, addition and
removal of features takes place.
Stopping criteria for selecting the best subset are usually pre-defined by the
person training the model such as when the performance of the model decreases
or a specific number of features has been achieved.
The main advantage of wrapper methods over the filter methods is that they
provide an optimal set of features for training the model, thus resulting in better
accuracy than the filter methods but are computationally more expensive.
Wrapper methods:
Wrapper Methods
Implementation
Some techniques used are:
Forward selection – This method is an iterative approach where we initially
start with an empty set of features and keep adding a feature which best
improves our model after each iteration. The stopping criterion is till the
addition of a new variable does not improve the performance of the model.
Backward elimination – This method is also an iterative approach where we
initially start with all features and after each iteration, we remove the least
significant feature. The stopping criterion is till no improvement in the
performance of the model is observed after the feature is removed.
Bi-directional elimination – This method uses both forward selection and
backward elimination technique simultaneously to reach one unique solution.
Some techniques used are:
Exhaustive selection – This technique is considered as the brute force
approach for the evaluation of feature subsets. It creates all possible
subsets and builds a learning algorithm for each subset and selects the
subset whose model’s performance is best.
Recursive elimination – This greedy optimization method selects
features by recursively considering the smaller and smaller set of features.
The estimator is trained on an initial set of features and their importance is
obtained using feature_importance_attribute. The least important features
are then removed from the current set of features till we are left with the
required number of features.
Embedded methods:
In embedded methods, the feature selection algorithm is blended as
part of the learning algorithm, thus having its own built-in feature
selection methods.
Embedded methods encounter the drawbacks of filter and wrapper
methods and merge their advantages.
These methods are faster like those of filter methods and more
accurate than the filter methods and take into consideration a
combination of features as well.
Embedded methods:
Embedded Methods
Implementation
Some techniques used are:
Regularization – This method adds a penalty to different parameters of
the machine learning model to avoid over-fitting of the model. This
approach of feature selection uses Lasso (L1 regularization) and Elastic
nets (L1 and L2 regularization). The penalty is applied over the coefficients,
thus bringing down some coefficients to zero. The features having zero
coefficient can be removed from the dataset.
Tree-based methods – These methods such as Random Forest, Gradient
Boosting provides us feature importance as a way to select features as
well. Feature importance tells us which features are more important in
making an impact on the target feature.
Curse of Dimensionality in
Machine Learning
Regularization – This method adds a penalty to different parameters of
the machine learning model to avoid over-fitting of the model. This
approach of feature selection uses Lasso (L1 regularization) and Elastic
nets (L1 and L2 regularization). The penalty is applied over the coefficients,
thus bringing down some coefficients to zero. The features having zero
coefficient can be removed from the dataset.
Tree-based methods – These methods such as Random Forest, Gradient
Boosting provides us feature importance as a way to select features as
well. Feature importance tells us which features are more important in
making an impact on the target feature.
Principal Component Analysis
(PCA)
As the number of features or dimensions in a dataset increases, the amount of data required
to obtain a statistically significant result increases exponentially.
This can lead to issues such as overfitting, increased computation time, and reduced
accuracy of machine learning models this is known as the curse of dimensionality
problems that arise while working with high-dimensional data.
As the number of dimensions increases, the number of possible combinations of features
increases exponentially, which makes it computationally difficult to obtain a representative
sample of the data and it becomes expensive to perform tasks such as clustering or
classification because it becomes.
Additionally, some machine learning algorithms can be sensitive to the number of dimensions,
requiring more data to achieve the same level of accuracy as lower-dimensional data.
Principal Component Analysis
(PCA)
To address the curse of dimensionality, Feature engineering techniques
are used which include feature selection and feature extraction.
Dimensionality reduction is a type of feature extraction technique that
aims to reduce the number of input features while retaining as much of
the original information as possible.
What is Principal Component
Analysis(PCA)?
Principal Component Analysis(PCA) technique was introduced by the
mathematician Karl Pearson in 1901.
It works on the condition that while the data in a higher dimensional
space is mapped to data in a lower dimension space, the variance of the
data in the lower dimensional space should be maximum.
What is Principal Component
Analysis(PCA)?
Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal
transformation that converts a set of correlated variables to a set of uncorrelated
variables.
PCA is the most widely used tool in exploratory data analysis and in machine learning for
predictive models. Moreover,
Principal Component Analysis (PCA) is an unsupervised learning algorithm technique used
to examine the interrelations among a set of variables.
It is also known as a general factor analysis where regression determines a line of best fit.
The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a
dataset while preserving the most important patterns or relationships between the
variables without any prior knowledge of the target variables.
What is Principal Component
Analysis(PCA)?
Principal Component Analysis (PCA) is used to reduce the
dimensionality of a data set by finding a new set of variables, smaller
than the original set of variables, retaining most of the sample’s
information, and useful for the regression and classification of data.
What is Principal Component
Analysis(PCA)?
Principal Component Analysis (PCA) is a technique for dimensionality reduction
that identifies a set of orthogonal axes, called principal components, that capture
the maximum variance in the data.
The principal components are linear combinations of the original variables in the
dataset and are ordered in decreasing order of importance.
The total variance captured by all the principal components is equal to the total
variance in the original dataset.
The first principal component captures the most variation in the data, but the
second principal component captures the maximum variance that is orthogonal
to the first principal component, and so on.
What is Principal Component
Analysis(PCA)?
Principal Component Analysis can be used for a variety of purposes, including data
visualization, feature selection, and data compression.
In data visualization, PCA can be used to plot high-dimensional data in two or three
dimensions, making it easier to interpret.
In feature selection, PCA can be used to identify the most important variables in a
dataset.
In data compression, PCA can be used to reduce the size of a dataset without losing
important information.
In Principal Component Analysis, it is assumed that the information is carried in the
variance of the features, that is, the higher the variation in a feature, the more information
that features carries.
Step-By-Step Explanation of PCA
(Principal Component Analysis)
Step 1: Standardization
First, we need to standardize our dataset to ensure that each variable has a mean of 0
and a standard deviation of 1.
𝑍=(𝑋−𝜇)/𝜎
Here,
• 𝜇 is the mean of independent features 𝜇={𝜇1,𝜇2,⋯,𝜇𝑚}
Covariance measures the strength of joint variability between two or more variables,
indicating how much they change in relation to each other. To find the covariance we
can use the formula:
𝐴𝑋=𝜆𝑋
for some scalar values 𝜆λ. then 𝜆λ is known as the eigenvalue of matrix A and X is known as the
eigenvector of matrix A for the corresponding eigenvalue.
𝐴𝑋−𝜆𝑋=0
(𝐴−𝜆𝐼)𝑋=0
Step-By-Step Explanation of PCA
(Principal Component Analysis)
will be true only if (𝐴–𝜆𝐼)(A–λI) will be non-invertible (i.e. singular matrix). That means,
where I am the identity matrix of the same shape as matrix A. And the above conditions
∣𝐴–𝜆𝐼∣=0
From the above equation, we can find the eigenvalues \lambda, and therefore
corresponding eigenvector can be found using the equation
𝐴𝑋=𝜆𝑋
Advantages of Principal
Component Analysis
Dimensionality Reduction: Principal Component Analysis is a popular technique
used for dimensionality reduction, which is the process of reducing the number of
variables in a dataset. By reducing the number of variables, PCA simplifies data
analysis, improves performance, and makes it easier to visualize data.
Feature Selection: Principal Component Analysis can be used for feature selection,
which is the process of selecting the most important variables in a dataset. This is
useful in machine learning, where the number of variables can be very large, and it is
difficult to identify the most important variables.
Data Visualization: Principal Component Analysis can be used for data visualization.
By reducing the number of variables, PCA can plot high-dimensional data in two or three
dimensions, making it easier to interpret.
Advantages of Principal
Component Analysis
Multicollinearity: Principal Component Analysis can be used to deal with multicollinearity, which is a common
problem in a regression analysis where two or more independent variables are highly correlated. PCA can help
identify the underlying structure in the data and create new, uncorrelated variables that can be used in the
regression model.
Noise Reduction: Principal Component Analysis can be used to reduce the noise in data. By removing the
principal components with low variance, which are assumed to represent noise, Principal Component Analysis can
improve the signal-to-noise ratio and make it easier to identify the underlying structure in the data.
Data Compression: Principal Component Analysis can be used for data compression. By representing the data
using a smaller number of principal components, which capture most of the variation in the data, PCA can reduce
the storage requirements and speed up processing.
Outlier Detection: Principal Component Analysis can be used for outlier detection. Outliers are data points
that are significantly different from the other data points in the dataset. Principal Component Analysis can
identify these outliers by looking for data points that are far from the other points in the principal component
space.
Disadvantages of Principal
Component Analysis
Interpretation of Principal Components: The principal components created by Principal
Component Analysis are linear combinations of the original variables, and it is often difficult
to interpret them in terms of the original variables. This can make it difficult to explain the
results of PCA to others.
Data Scaling: Principal Component Analysis is sensitive to the scale of the data. If the data
is not properly scaled, then PCA may not work well. Therefore, it is important to scale the
data before applying Principal Component Analysis.
Information Loss: Principal Component Analysis can result in information loss. While
Principal Component Analysis reduces the number of variables, it can also lead to loss of
information. The degree of information loss depends on the number of principal components
selected. Therefore, it is important to carefully select the number of principal components to
retain.
Disadvantages of Principal
Component Analysis
Non-linear Relationships: Principal Component Analysis assumes that the
relationships between variables are linear. However, if there are non-linear
relationships between variables, Principal Component Analysis may not work well.
Computational Complexity: Computing Principal Component Analysis can be
computationally expensive for large datasets. This is especially true if the number
of variables in the dataset is large.
Overfitting: Principal Component Analysis can sometimes result in overfitting,
which is when the model fits the training data too well and performs poorly on
new data. This can happen if too many principal components are used or if the
model is trained on a small dataset.
Dataset Validation Techniques
Cross-Validation in Machine Learning
Cross-validation is a technique for validating the model efficiency by training it on the subset
of input data and testing on previously unseen subset of the input data.
In machine learning, there is always the need to test the stability of the model.
It means based only on the training dataset; we can't fit our model on the training dataset.
For this purpose, we reserve a particular sample of the dataset, which was not part of the
training dataset.
After that, we test our model on that sample before deployment, and this complete process
comes under cross-validation.
This is something different from the general train-test split.
Dataset Validation Techniques
Hence the basic steps of cross-validations are:
Reserve a subset of the dataset as a validation set.
Provide the training to the model using the training dataset.
Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else check
for the issues.
Methods used for Cross-
Validation
Some common methods are used for cross-validation.
These methods are given below:
1.Validation Set Approach
2.Leave-P-out cross-validation
4.K-fold cross-validation