Introduction to Machine Learning
Introduction to Machine Learning
TY (CSE)
Machine Learning
Prof. S.L.Birajdar
Unit 1
Introduction to Machine Learning
Contents
1.4 Evaluation
1.7 Overfitting
1.1 Definition of Machine Learning
"Machine learning is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed."
OR
"A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E."
a) Supervised Learning:
Supervised machine learning algorithms are designed to learn by example. The name
“supervised” learning originates from the idea that training this type of algorithm is like
having a teacher supervise the whole process.
When training a supervised learning algorithm, the training data will consist of inputs paired
with the correct outputs. During training, the algorithm will search for patterns in the data
that correlate with the desired outputs. After training, a supervised learning algorithm will
take in new unseen inputs and will determine which label the new inputs will be classified as
based on prior training data.
Fig. Supervised Learning.
The objective of a supervised learning model is to predict the correct label for newly
presented input data. At its most basic form, a supervised learning algorithm can be written
simply as:
Where Y is the predicted output that is determined by a mapping function that assigns a class
to an input value x. The function used to connect input features to a predicted output is
created by the machine learning model during training.
Fig. Classification.
During training, a classification algorithm will be given data points with an assigned
category. The job of a classification algorithm is to then take an input value and assign it a
class, or category, that it fits into based on the training data provided.
The most common example of classification is determining if an email is spam or not. With
two classes to choose from (spam, or not spam), this problem is called a binary classification
problem. The algorithm will be given training data with emails that are both spam and not
spam. The model will find the features within the data that correlate to either class and create
the mapping function mentioned earlier: Y=f(x). Then, when provided with an unseen email,
the model will use this function to determine whether or not the email is spam.
Classification problems can be solved with a numerous amount of algorithms. Whichever
algorithm you choose to use depends on the data and the situation.
Here are a few popular classification algorithms:
• Linear Classifiers
• Support Vector Machines
• Decision Trees
• K-Nearest Neighbor
• Random Forest
2. Regression :
Regression is a predictive statistical process where the model attempts to find the important
relationship between dependent and independent variables. The goal of a regression
algorithm is to predict a continuous number such as sales, income, and test scores. The
equation for basic linear regression can be written as so:
Where x[i] is the feature(s) for the data and where w[i] and b are parameters which are
developed during training. For simple linear regression models with only one feature in the
data, the formula looks like this:
Where w is the slope, x is the single feature and b is the y-intercept. Familiar? For simple
regression problems such as this, the models predictions are represented by the line of best
fit. For models using two features, the plane will be used. Finally, for a model using more
than two features, a hyperplane will be used. Imagine we want to determine a student’s test
grade based on how many hours they studied the week of the test. Let's say the plotted data
with a line of best fit looks like this
Fig. Regression.
There is a clear positive correlation between hours studied (independent variable) and the
student’s final test score (dependent variable). A line of best fit can be drawn through the
data points to show the models predictions when given a new input. Say we wanted to know
how well a student would do with five hours of studying. We can use the line of best fit to
predict the test score based on other student’s performances.
b) Unsupervised Learning:
Unsupervised learning is a machine learning technique, where you do not need to supervise
the model. Instead, you need to allow the model to work on its own to discover information.
It mainly deals with the unlabelled data. Unsupervised learning algorithms allows you to
perform more complex processing tasks compared to supervised learning. Although,
unsupervised learning can be more unpredictable compared with other natural learning
methods.
c) Reinforcement Learning:
It is the ability of an agent to interact with the environment and find out what is the best
outcome. It follows the concept of hit and trial method. The agent is rewarded or penalized
with a point for a correct or a wrong answer, and on the basis of the positive reward points
gained the model trains itself. And once trained it gets ready to predict the new data
presented to it.
Supervised learning where a student is under the supervision of a teacher at both home and
school, Unsupervised learning where a student has to figure out a concept himself and Semi-
Supervised learning where a teacher teaches a few concepts in class and gives questions as
homework which are based on similar concepts.
b) Inductive Bias:- The inductive bias (also known as learning bias) of a learning
algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it
has not encountered. In machine learning, one aims to construct algorithms that are able
to learn to predict a certain target output. To achieve this, the learning algorithm is presented
some training examples that demonstrate the intended relation of input and output values.
Then the learner is supposed to approximate the correct output, even for examples that have
not been shown during training. Without any additional assumptions, this problem cannot be
solved exactly since unseen situations might have an arbitrary output value. The kind of
necessary assumptions about the nature of the target function are subsumed in the
phrase inductive bias.
1.3.1 Find-S Algorithm:
In order to understand Find-S algorithm, you need to have a basic idea of the following
concepts as well:
1. Concept Learning
2. General Hypothesis
3. Specific Hypothesis
1. Concept Learning:
Most of human learning is based on past instances or experiences. For example, we are able
to identify any type of vehicle based on a certain set of features like make, model, etc., that
are defined over a large set of features.
These special features differentiate the set of cars, trucks, etc from the larger set of vehicles.
These features that define the set of cars, trucks, etc are known as concepts.
Similar to this, machines can also learn from concepts to identify whether an object belongs
to a specific category or not. Any algorithm that supports concept learning requires the
following:
• Training Data
• Target Concept
• Actual Data Objects
2. General Hypothesis:
3. Specific Hypothesis:
The specific hypothesis fills in all the important details about the variables given in the
general hypothesis. The more specific details into the example given above would be I want a
cheeseburger with a chicken pepperoni filling with a lot of lettuce.
S = {‘Φ’,’Φ’,’Φ’, ……,’Φ’}
1. The process starts with initializing ‘h’ with the most specific hypothesis, generally, it is
the first positive example in the data set.
2. We check for each positive example. If the example is negative, we will move on to the
next example but if it is a positive example we will consider it for the next step.
3. We will check if each attribute in the example is equal to the hypothesis value.
4. If the value matches, then no changes are made.
5. If the value does not match, the value is changed to ‘?’.
6. We do this until we reach the last positive example in the data set.
To understand the implementation, let us try to implement it to a smaller data set with a
bunch of examples to decide if a person wants to go for a walk.
The concept of this particular problem will be on what days does a person likes to go on
walk.
Fig. Dataset
There are a few limitations of the Find-S algorithm listed down below:
1. There is no way to determine if the hypothesis is consistent throughout the data.
2. Inconsistent training sets can actually mislead the Find-S algorithm, since it ignores the
negative examples.
3. Find-S algorithm does not provide a backtracking technique to determine the best
possible changes that could be done to improve the resulting hypothesis.
The candidate-Elimination algorithm computes the version space containing all (and only
those) hypotheses from H that are consistent with an observed sequence of training
examples.
➢ Candidate Elimination Algorithm performs a bidirectional search in the hypothesis
space.
➢ It maintains a set, S,of most specific hypotheses that are consistent with the training
data and a set ,G, of most general hypotheses that are consistent with the training
data.
➢ These two set forms two boundaries on a version space. It consider both positive and
negative examples. We have both hypothesis i.e most general and most specific.
➢ For a positive example :
We tend to generalize specific hypothesis.
➢ For a negative example :
Fig. Dataset
S: {<ɸ,ɸ,ɸ,ɸ,ɸ,ɸ>}
G: {<?,?,?,?,?,?>}
x = <Sunny, Warm, Normal, Strong, Warm, Same>
S:{ <Sunny , Warm , Normal , Strong , Warm , Same>}
G: {<?,?,?,?,?,?>}
x = <Sunny, Warm, High, Strong, Warm, Same>
S:{ <Sunny , Warm , ? , Strong , Warm , Same>}
G: {<?,?,?,?,?,?>}
x = <Rainy, Cold, High, Strong, Warm, Change>
S:{ <Sunny , Warm , ? , Strong , Warm , Same>}
G: {<Sunny,?,?,?,?,?>,<?,Warm,?,?,?,?>,<?,?,?,?,?,Same>}
x = <Sunny, Warm, High, Strong, Cool, Change>
S:{ <Sunny , Warm , ? , Strong , ? , ?>}
G: {<Sunny,?,?,?,?,?>,<?,Warm,?,?,?,?>}
b) Cross-Validation: Cross-validation is a technique in which we train our model using the subset
of the data-set and then evaluate using the complementary subset of the data-set.
1. Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is used
for the training purpose. The major drawback of this method is that we perform training
on the 50% of the dataset, it may possible that the remaining 50% of the data contains
some important information which we are leaving while training our model i.e. higher
bias.
In this method, we perform training on the whole data-set but leaves only one data-point
of the available data-set and then iterates for each data-point. It has some advantages as
well as disadvantages. An advantage of using this method is that we make use of all data
points and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the testing model
as we are testing against one data point. If the data point is an outlier it can lead to higher
variation. Another drawback is it takes a lot of execution time as it iterates over ‘the
number of data points’ times.
In this method, we split the data-set into k number of subsets(known as folds) then we
perform training on the all the subsets but leave one(k-1) subset for the evaluation of the
trained model. In this method, we iterate k times with a different subset reserved for
testing purpose each time.
Example:
The diagram below shows an example of the training subsets and evaluation subsets
generated in k-fold cross-validation. Here, we have total 25 instances. and [1-5 and 10-
25] training), and so on.
In first iteration we use the first 20 percent of data for evaluation, and the remaining 80
percent for training([1-5] testing and [5-25] training) while in the second iteration we use
the second subset of 20 percent for evaluation, and the remaining three subsets of the data
for training([5-10] testing.
Linear regression is a linear model that assumes a linear relationship between the input
variables (x) and the single output variable (y). More specifically, that y can be calculated
from a linear combination of the input variables (x).
When there is a single input variable (x), the method is referred to as simple linear
regression.
When there are multiple input variables, the method is known as multiple linear regression.
In simple linear regression, we establish a relationship between target variable and input
variables by fitting a line, known as the regression line.
In general, a line can be represented by linear equation y = m * X + b. Where, y is the
dependent variable, X is the independent variable, m is the slope, b is the intercept.
In machine learning, we rewrite our equation as y(x) = w0+ w1 * x where w’s are the
parameters of the model, x is the input, and y is the target variable.
The above equation can be used when we have one input variable (also called feature).
However, in general, we usually deal with datasets which have multiple input variables.
The case when we have more than one feature is known as multiple linear regression, or
simply, linear regression. We can generalize our previous equation for simple linear
regression to multiple linear regression:
In the case of multiple linear regression, instead of our prediction being a line in 2-
dimensional space, it is a hyper plane in n-dimensional space.
The following terminologies are important to be familiar with before moving on to the linear
regression algorithm.
a) Cost Function:
The best fit line can be based on the linear equation given below.
• b1 is the slope of the line, x represents the independent variables that determine
the prediction of Y.
• The error in the resultant prediction is denoted by e.
The cost function provides the best possible values for b0 and b1 to make the best fit line for
the data points. We do it by converting this problem into a minimization problem to get the
best values for b0 and b1. The error is minimized in this problem between the actual value and
the predicted value.
We choose the function above to minimize the error. We square the error difference and sum
the error over all data points, the division between the total number of data points. Then, the
produced value provides the averaged square error over all data points.
It is also known as MSE(Mean Squared Error), and we change the values of b0 and b1 so that
the MSE value is settled at the minimum.
b) Gradient Descent:
To update b0 and b1, we take gradients from the cost function. To find these gradients, we
take partial derivatives with respect to b0 and b1. These partial derivatives are the gradients
and are used to update the values of b0 and b1.
Fig.learning rate
A smaller learning rate takes closer to the minimum, but it takes more time and in case of a
larger learning rate. The time taken is sooner but there is a chance to overshoot the minimum
value. Now that we are through with the terminologies in linear regression, let us take a look
at a few advantages and disadvantages of linear regression for machine learning.
• Get list of rows (dataset) which are taken into consideration for making decision tree
(recursively at each node).
• Calculate uncertainty of our dataset or Gini impurity or how much our data is mixed
up etc.
• Generate list of all question which needs to be asked at that node.
• Partition rows into True rows and False rows based on each question asked.
• Calculate information gain based on gini impurity and partition of data from previous
step.
• Update highest information gain based on each question asked.
• Update best question based on information gain (higher information gain).
• Divide the node on best question. Repeat again from step 1 again until we get pure node
(leaf nodes).
a) Underfitting:
b) Overfitting:
A statistical model is said to be overfitted, when we train it with a lot of data . When a
model gets trained with so much of data, it starts learning from the noise and inaccurate
data entries in our data set. Then the model does not categorize the data correctly,
because of too much of details and noise. The causes of overfitting are the non-
parametric and non-linear methods because these types of machine learning algorithms
have more freedom in building the model based on the dataset and therefore they can
really build unrealistic models. A solution to avoid overfitting is using a linear algorithm
if we have linear data or using the parameters like the maximal depth if we are using
decision trees.
Ideally, the case when the model makes the predictions with 0 error, is said to have a good
fit on the data. This situation is achievable at a spot between overfitting and underfitting. In
order to understand it we will have to look at the performance of our model with the passage
of time, while it is learning from training dataset.
With the passage of time, our model will keep on learning and thus the error for the model on
the training and testing data will keep on decreasing. If it will learn for too long, the model
will become more prone to overfitting due to the presence of noise and less useful details.
Hence the performance of our model will decrease. In order to get a good fit, we will stop at
a point just before where the error starts increasing. At this point the model is said to have
good skills on training datasets as well as our unseen testing dataset.
Important Questions
Q.1] What is machine learning? Explain different types of machine learning.
Q.2] Explain types of supervised learning.
Q.3] Explain hypothesis space and inductive bias.
Q.4] Explain Find S algorithm.
Q.5] Explain Candidate Elimination algorithm.
Q.6] Explain cross-validation.
Q.7] Explain Linear regression.
Q.8] Explain Decision trees.
Q.9] Explain underfitting & overfitting in machine learning.