ML UNIT-I
ML UNIT-I
Uncover patterns & trends in data: Finding hidden patterns and extracting key insights from
data is the most essential part of Machine Learning. By building predictive models and using
statistical techniques, Machine Learning allows you to dig beneath the surface and
explore the data at a minute scale. Understanding data and extracting patterns
manually will take days, whereas Machine Learning algorithms can perform such
computations in less than a second.
Solve complex problems: From detecting the genes linked to the deadly ALS disease to
building self- driving cars, Machine Learning can be used to solve the most complex
problems.
To give you a better understanding of how important Machine Learning is, let’s list down a
couple of Machine
Learning Applications:
Netflix’s Recommendation Engine: The core of Netflix is its infamous recommendation
engine. Over 75% of what you watch is recommended by Netflix and these
recommendations are made by implementing Machine Learning.
Amazon’s Alexa: The infamous Alexa, which is based on Natural Language Processing
and Machine Learning is an advanced level Virtual Assistant that does more than just
play songs on your playlist. It can book you an Uber, connect with the other IoT devices
Google’s Spam Filter: Gmail makes use of Machine Learning to filter out spam messages. It
uses Machine Learning algorithms and Natural Language Processing to analyze emails in real-
time and classify them as either spam or non-spam.
Introduction To Machine Learning
The term Machine Learning was first coined by Arthur Samuel in the year 1959. Looking back,
that year was probably the most significant in terms of technological advancements.
If you browse through the net about ‘what is Machine Learning’, you’ll get at least 100
different definitions.
However, the very first formal definition was given by Tom M. Mitchell:
“A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P if its performance at tasks in T, as measured by P, improves with
experience E.”
To sum it up, take a look at the above figure. A Machine Learning process begins by feeding the
machine lots of data, by using this data the machine is trained to detect hidden insights and
trends. These insights are then used to build a Machine Learning Model by using an algorithm in
order to solve a problem.
The next topic in this Introduction to Machine Learning blog is the Machine
Learning Process. Machine Learning Process
The Machine Learning process involves building a Predictive model that can be used to find a
solution for a Problem Statement. To understand the Machine Learning process let’s assume
that you have been given a problem that needs to be solved by using Machine Learning.
The problem is to predict the occurrence of rain in your local area by using Machine Learning.
The below steps are followed in a Machine Learning process:
Step 1: Define the objective of the Problem Statement
At this step, we must understand what exactly needs to be predicted. In our case, the
objective is to predict the possibility of rain by studying weather conditions. At this stage, it is
also essential to take mental notes on what kind of data can be used to solve this problem or
the type of approach you must follow to get to the solution.
Step 2: Data Gathering
At this stage, you must be asking questions such as,
What kind of data is needed to solve this problem?
Is the data available?
How can I get the data?
Once you know the types of data that is required, you must understand how you can derive
this data. Data collection can be done manually or by web scraping. However, if you’re a
beginner and you’re just looking to learn Machine Learning you don’t have to worry about
getting the data. There are 1000s of data resources on the web, you can just download the
data set and get going.
Coming back to the problem at hand, the data needed for weather forecasting includes
measures such as humidity level, temperature, pressure, locality, whether or not you live in a
hill station, etc. Such data must be collected and stored for analysis.
Consider the above figure. Here we’re feeding the machine images of Tom and Jerry and the
goal is for the machine to identify and classify the images into two groups (Tom images and
Jerry images). The training data set that is fed to the model is labeled, as in, we’re telling the
machine, ‘this is how Tom looks and this is Jerry’. By doing so you’re training the machine by
using labeled data. In Supervised Learning, there is a well-defined training phase done with
the help of labeled data.
Unsupervised Learning
Unsupervised learning involves training by using unlabeled data and allowing the model to act
on that information without guidance.
Think of unsupervised learning as a smart kid that learns without any guidance. In this type of
Machine Learning, the model is not fed with labeled data, as in the model has no clue that
‘this image is Tom and this is Jerry’, it figures out patterns and the differences between Tom
and Jerry on its own by taking in tons of data.
For example, it identifies prominent features of Tom such as pointy ears, bigger size, etc, to
understand that this image is of type 1. Similarly, it finds such features in Jerry and knows that
this image is of type 2. Therefore, it classifies the images into two different classes without
knowing who Tom is or Jerry is.
Reinforcement Learning
Reinforcement Learning is a part of Machine learning where an agent is put in an environment
and he learns to behave in this environment by performing certain actions and observing the
rewards which it gets from those actions.
This type of Machine Learning is comparatively different.
Imagine that you were dropped off at an isolated island! What would you do?
Panic? Yes, of course, initially we all would. But as time passes by, you will learn how to live
on the island. You will explore the environment, understand the climate condition, the type of
food that grows there, the dangers of the island, etc. This is exactly how Reinforcement
Learning works, it involves an Agent (you, stuck on the island) that is put in an unknown
environment (island), where he must learn by observing and performing actions that result in
rewards.
Reinforcement Learning is mainly used in advanced Machine Learning areas such as self-driving
cars, AplhaGo, etc.
1. Artificial Intelligence, Machine Learning, Deep learning
Deep Learning, Machine Learning, and Artificial Intelligence are the most used terms on the
internet for IT folks. However, all these three technologies are connected with each other. Artificial
Intelligence (AI) can be understood as an umbrella that consists of both Machine learning and
deep learning. Or We can say deep learning and machine learning both are subsets of artificial
intelligence.
As these technologies look similar, most of the persons have misconceptions about 'Deep
Learning, Machine learning, and Artificial Intelligence' that all three are similar to each other.
But in reality, although all these technologies are used to build intelligent machines or
applications that behave like a human, still, they differ by their functionalities and scope.
It means these three terms are often used interchangeably, but they do not quite refer to the
same things. Let's understand the fundamental difference
between
deep learning, machine learning, and Artificial
Intelligence with the below image.
With the above image, you can understand Artificial Intelligence is a branch of computer
science that helps us to create smart, intelligent machines. Further, ML is a subfield of AI that
helps to teach machines and build AI- driven applications. On the other hand, Deep learning is
the sub-branch of ML that helps to train ML models with a huge amount of input and complex
algorithms and mainly works with neural networks.
What is Artificial Intelligence (AI)?
Artificial Intelligence is defined as a field of science and engineering that deals with making intelligent
machines or computers to perform human-like activities.
Mr. John McCarthy is known as the godfather of this amazing invention. There are some
popular definitions of AI, which are as follows:
"AI is defined as the capability of machines to imitate intelligent human behavior."
"A computer system able to perform tasks that normally require human intelligence, such as visual
perception, speech recognition, decision-making, and translation between languages."
What is Deep Learning?
"Deep learning is defined as the subset of machine learning and artificial intelligence that is based
on artificial neural networks". In deep learning, the deep word refers to the number of layers
in a neural network.
Deep Learning is a set of algorithms inspired by the structure and function of the human
brain. It uses a huge amount of structured as well as unstructured data to teach computers
and predicts accurate results. The main difference between machine learning and deep
learning technologies is of presentation of data. Machine
learning uses structured/unstructured data for learning, while deep learning uses neural
networks for learning models.
In the above image, we can see that even if our model is “AWESOME” and we feed it with
garbage data, the result will also be garbage(output). Our training data must always contain
more relevant and less to none irrelevant features.
The credit for a successful machine learning project goes to coming up with a good set of
features on which it has been trained (often referred to as feature engineering ), which
includes feature selection, extraction, and creating new features which are other interesting
topics to be covered in upcoming blogs.
4. Nonrepresentative training data:
To make sure that our model generalizes well, we have to make sure that our training data
should be representative of the new cases that we want to generalize to.
If train our model by using a nonrepresentative training set, it won’t be accurate in predictions
it will be biased against one class or a group.
For E.G., Let us say you are trying to build a model that recognizes the genre of music. One
way to build your training set is to search it on youtube and use the resulting data. Here we
assume that youtube’s search engine is providing representative data but in reality, the search
will be biased towards popular artists and maybe even the artists that are popular in your
location(if you live in India you will be getting the music of Arijit Singh, Sonu Nigam or etc).
So use representative data during training, so your model won’t be biased among one or two
classes when it
works on testing data.
5. Overfitting the Training Data
Overfitting happens when the model is too complex relative to the amount and noisiness of
the training data. The possible solutions are:
To simplify the model by selecting one with fewer parameters (e.g., a linear model rather than
a high-degree polynomial model), by reducing the number of attributes in the training data or
by constraining the model
• To gather more training data
• To reduce the noise in the training data (e.g., fix data errors and remove outliers)
6. Underfitting the Training Data
Underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the
underlying structure of the data. For example, a linear model of life satisfaction is prone to
underfit; reality is just more complex than the model, so its predictions are bound to be
inaccurate, even on the training examples.
The main options to fix this problem are:
• Selecting a more powerful model, with more parameters
• Feeding better features to the learning algorithm (feature engineering)
• Reducing the constraints on the model (e.g., reducing the regularization hyperparameter)
CHAPTER-
II
Q) Sampling distribution of an estimator
In statistics, an estimator is a function of the data that is used to estimate an unknown
parameter of the population. The sampling distribution of an estimator refers to the
distribution of the estimator over many different samples drawn from the population.
More specifically, the sampling distribution of an estimator is the probability distribution of the
estimator's values when computed from a large number of random samples of fixed size, taken
from the same population.
For example, suppose we are interested in estimating the mean weight of all the students in a
school. We take a random sample of 100 students from the school and compute the sample
mean weight. We repeat this process many times, each time taking a different random sample
of 100 students. The sampling distribution of the sample mean weight is the distribution of the
mean weight computed from all these different samples.
In machine learning, the concept of an estimator is related to the idea of a model. An
estimator in machine learning is a function or algorithm that is used to estimate the value of
some unknown parameter or function based on a set of observed data. For example, in linear
regression, the estimator is a linear function that estimates the relationship between the input
variables and the output variable.
The sampling distribution of an estimator in machine learning can be thought of as the
distribution of the estimator's performance over many different samples drawn from the same
population. In this context, the "performance" of the estimator refers to its ability to
accurately predict the unknown parameter or function.
For example, suppose we are using a linear regression model to predict housing prices based on
various features of the house (e.g. size, location, number of rooms, etc.). We can generate
multiple random samples of houses from the same population and use each sample to train
and test our model. The sampling distribution of our estimator (i.e. the linear regression
model) would be the distribution of the model's performance (e.g. mean squared error, R-
squared, etc.) over all of these different samples.
Understanding the sampling distribution of an estimator in machine learning is important
because it allows us to make statements about the model's expected performance on new,
unseen data. For example, we can use the sampling distribution to construct confidence
intervals for the model's predictions, or to perform hypothesis tests to determine whether the
model is significantly better than a baseline or alternative model.
1.Bootstrap
Bootstrap is a statistical resampling technique that can be used to estimate the sampling
distribution of an estimator in machine learning. It involves repeatedly sampling the original
dataset with replacement to generate a large number of "bootstrap samples", which are then
used to estimate the sampling distribution of the estimator.
To use bootstrap in machine learning, we first train our estimator (e.g. a machine learning
model) on the original dataset. We then generate a large number of bootstrap samples by
randomly sampling the original dataset with replacement. For each bootstrap sample, we train
a new instance of the estimator on the sample, and use it to estimate the parameter or
function of interest. We repeat this process many times to generate a large number of
estimates, which can be used to estimate the sampling distribution of the estimator.
For example, suppose we are using a decision tree to predict whether a customer will buy a
product based on their age, income, and other demographic information. We can use
bootstrap to estimate the sampling distribution of the decision tree's accuracy on new, unseen
data. We first train the decision tree on the original dataset. We then generate a large number
of bootstrap samples by randomly sampling the original dataset with replacement. For each
bootstrap sample, we train a new instance of the decision tree on the sample, and use it to
predict the outcomes of a test set. We repeat this process many times to generate a large
number of estimates of the decision tree's accuracy on new, unseen data. We can then use these
estimates to construct a confidence interval or perform hypothesis testing to make statements
about the decision tree's expected accuracy on new, unseen data.
Bootstrap can be a powerful technique for estimating the sampling distribution of an
estimator, especially when the distribution is difficult or impossible to calculate analytically.
However, it can be computationally intensive, especially for large datasets or complex
models.
2. Large sample theory for the MLE
In machine learning, the maximum likelihood estimator (MLE) is a commonly used method for
estimating the parameters of a probabilistic model. Large sample theory is a branch of
statistics that studies the behavior of estimators as the sample size becomes very large.
The large sample theory for the MLE in machine learning is based on the idea that, as the sample
size increases, the distribution of the MLE becomes approximately normal (i.e. follows a normal
distribution). This result is known as the central limit theorem.
More specifically, under certain assumptions, as the sample size n approaches infinity, the
distribution of the MLE becomes approximately normal with mean equal to the true parameter
value and variance equal to the inverse of the Fisher information matrix evaluated at the true
parameter value. The Fisher information matrix is a measure of how much information the
data contains about the parameter.
This result has important implications for machine learning, as it allows us to make
statements about the expected performance of the MLE as the sample size increases. For
example, we can use the central limit theorem to construct confidence intervals for the MLE's
estimates, or to perform hypothesis tests to determine whether the estimates are significantly
different from a hypothesized value.
However, it is important to note that the large sample theory for the MLE assumes that
certain conditions are met, such as that the model is correctly specified and that the data are
independent and identically distributed. Violations of these assumptions can lead to biased or
inconsistent estimates, even in large samples. Therefore, it is important to carefully consider the
assumptions underlying the MLE and the large sample theory before applying them in
practice.
Q) Empirical Risk Minimization
where n is the size of the dataset, L(yi, f(xi)) is the loss function that measures the
discrepancy between the model's prediction f(xi) and the true output yi for each example i.
ERM involves finding the model that minimizes the empirical risk over the training data. This
is typically done by choosing a parametric form for the model (e.g. a neural network, decision
tree, linear regression, etc.) and then searching for the values of the parameters that
minimize the empirical risk. This process is often called "training" the model, and typically
involves using an optimization algorithm (e.g. stochastic gradient descent) to iteratively
update the parameters to reduce the empirical risk.
One important consideration when using ERM is overfitting, which occurs when the model
becomes too complex and begins to fit the noise in the training data rather than the
underlying signal. Regularization techniques, such as L1 or L2 regularization, are often used
to prevent overfitting by adding a penalty term to the empirical risk that discourages the
model from using overly complex parameter values.
Overall, ERM is a powerful and widely used approach for training machine learning models,
but it is important to carefully consider the choice of loss function, regularization, and other
hyperparameters to avoid overfitting and ensure good generalization performance on unseen
data.
1.Regularized risk minimization
Regularized Risk Minimization (RRM) is a principle in machine learning that involves finding a
model that minimizes a regularized version of the expected risk on unseen data. It is similar to
Empirical Risk Minimization (ERM), but with the addition of a regularization term that
penalizes complex models and encourages simpler ones.
The regularized risk of a model is defined as:
where E[L(Y, f(X))] is the expected loss of the model on unseen data, Ω(f) is a complexity
measure of the model, and λ is a regularization parameter that controls the trade-off between
the loss and complexity terms.
The regularized risk can be minimized by finding the model that achieves the optimal trade-
off between the loss and complexity terms.
RRM is a powerful approach for training machine learning models that can help prevent
overfitting and improve generalization performance on unseen data. However, it is important
to choose an appropriate regularization technique and regularization parameter to balance the
trade-off between model complexity and performance.
2. Structural risk minimization
Structural Risk Minimization (SRM) is a principle in machine learning that aims to find a model
that minimizes the expected risk on unseen data while also controlling the complexity of the
model. It is similar to Empirical Risk Minimization (ERM) and Regularized Risk Minimization
(RRM), but with a focus on model selection and choosing the optimal level of complexity.
The basic idea behind SRM is to balance the bias-variance trade-off in model selection. A
model with low complexity (e.g. a linear model) may have low variance but high bias, while a
complex model (e.g. a deep neural network) may have low bias but high variance. SRM aims
to find the optimal level of complexity that minimizes both the bias and variance of the model.
One way to achieve this is to use a two-stage approach for model selection. In the first stage,
a family of models with varying complexity is generated, such as a set of neural networks with
different numbers of layers. In the second stage, the optimal model is chosen from the family
based on its performance on a validation set or through cross-validation.
Another approach is to use a model selection criterion that penalizes both the loss and
complexity terms, such as the Bayesian Information Criterion (BIC) or the Akaike Information
Criterion (AIC). These criteria balance the trade-off between goodness of fit and model
complexity and can help to avoid overfitting.
SRM is an important principle in machine learning that can help to improve the generalization
performance of models by controlling their complexity. However, it can be computationally
expensive to search over a large family of models, and there is often a trade-off between
model complexity and performance that must be carefully balanced.
3. Estimating the risk using cross validation
Cross-validation is a widely used technique in machine learning for estimating the
generalization performance of a model and selecting its hyperparameters. One common
application of cross-validation is to estimate the risk of a model, which is the expected loss on
unseen data.
The basic idea behind cross-validation is to partition the available data into several subsets or
"folds". For example, in k-fold cross-validation, the data is divided into k non-overlapping folds
of equal size. Then, the model is trained on k-1 folds and evaluated on the remaining fold,
with the process repeated k times so that
each fold is used for evaluation once. The average performance of the model across the k
folds is then used as an estimate of its generalization performance.
To estimate the risk of a model using cross-validation, the data is first randomly divided into
training and testing sets. The training set is used to train the model with a specific set of
hyperparameters, while the testing set is used to evaluate its performance. However, since
there is only one testing set, the estimate of the risk may be biased and have high variance.
Cross-validation helps to mitigate this issue by using multiple testing sets and averaging the
results. By repeatedly training and evaluating the model on different subsets of the data, cross-
validation provides a more robust estimate of the model's performance and helps to reduce the
variance of the estimate.
The choice of the number of folds (k) and the method for partitioning the data can affect the
performance and computational efficiency of cross-validation. For example, leave-one-out
cross-validation (LOOCV) is a special case of k-fold cross-validation where k is equal to the
number of samples, which can be computationally expensive but provides an unbiased estimate
of the risk. On the other hand, stratified sampling can be used to ensure that the partitions of
the data are representative of the class distributions.
Overall, cross-validation is a powerful technique for estimating the risk of a model and
selecting its hyperparameters, and is widely used in machine learning for evaluating and
comparing different models.
4. Upper bounding the risk using statistical learning theory
Statistical learning theory is a framework in machine learning that provides theoretical
bounds on the generalization performance of models. These bounds can be used to estimate
the risk of a model and to guide the selection of hyperparameters.
The main idea behind statistical learning theory is to use the concept of empirical risk
minimization (ERM) to bound the expected risk of a model. ERM is a principle in machine
learning that aims to find a model that minimizes the empirical risk on the training data, which is
the average loss over the samples in the training set. The expected risk, on the other hand, is
the average loss over all possible data sets drawn from the underlying distribution.
The key insight of statistical learning theory is that the expected risk can be upper-bounded
by the empirical risk plus a term that depends on the complexity of the model and the size of
the data set. This term, known as the "generalization error", quantifies the difference between
the expected risk and the empirical risk and is often used as a measure of the model's
generalization performance.
There are different types of bounds in statistical learning theory, such as the Rademacher
complexity bound, the VC-dimension bound, and the uniform convergence bound. These
bounds provide different levels of generality and tightness, depending on the assumptions
made about the underlying distribution and the complexity of the model.
In practice, statistical learning theory can be used to estimate the risk of a model by
evaluating its empirical risk on the training data and using the generalization error term to
compute an upper bound on the expected risk. This can help to guide the selection of
hyperparameters and to avoid overfitting, which occurs when the model fits the noise in the
training data and fails to generalize to new data.
Overall, statistical learning theory provides a powerful tool for understanding the
generalization performance of models and for guiding the design and evaluation of machine
learning algorithms.
5. Surrogate loss functions
A surrogate loss function in machine learning is a differentiable function that is used in place
of the true loss function when it is difficult or impossible to optimize directly. The surrogate
loss function is designed to approximate the behavior of the true loss function in a way that
makes it easier to optimize the model parameters.
In many machine learning tasks, the true loss function is non-differentiable or has other
properties that make it difficult to optimize directly. For example, the 0-1 loss function,
which measures the number of
misclassifications, is non-differentiable and discontinuous. Similarly, the hinge loss function, which
is commonly used in support vector machines, is non-differentiable at the origin.
To overcome these challenges, surrogate loss functions are used to approximate the true loss
function while preserving desirable properties, such as differentiability and convexity. Surrogate
loss functions can be derived by making certain assumptions about the relationship between
the model outputs and the true labels, and by designing a function that reflects these
assumptions.
One common example of a surrogate loss function is the cross-entropy loss function, which is
used for binary classification and measures the distance between the predicted probabilities
and the true labels. The cross- entropy loss function is differentiable and convex, and is often
used as a surrogate for the 0-1 loss function, which is non-differentiable.
Another example is the hinge loss function, which is used for binary classification with support
vector machines. The hinge loss function is non-differentiable at the origin, but can be
approximated by a differentiable function, such as the smoothed hinge loss function, which is
convex and easier to optimize.
Surrogate loss functions can also be used in multi-class classification, regression, and other
machine learning tasks, where the true loss function may be difficult to optimize directly. The
choice of the surrogate loss function can have a significant impact on the performance of the
model and the speed of convergence, and is an important consideration in the design and
evaluation of machine learning algorithms.
www.jntumaterials.co.in
www.jntumaterials.co.in
www.jntumaterials.co.in
www.jntumaterials.co.in
www.jntumaterials.co.in
www.jntumaterials.co.in
www.jntumaterials.co.in
www.jntumaterials.co.in