Lecture 1
Lecture 1
Contents:
1.1 Machine Learning, Types of Machine Learning, Issues in Machine Learning, Application of
Machine Learning, Steps in developing a Machine Learning Application.
1.2 Training Error, Generalization error, Overfitting, Underfitting, Bias Variance trade-off.
Machine Learning is the most popular technique of predicting the future or classifying
information to help people in making necessary decisions.
Artificial intelligence (AI) has an area called "machine learning" that focuses on creating models
and algorithms that let computers make decisions and predictions based on data without having
to be explicitly programmed. Creating computer systems that can automatically learn from
experience or examples is the main objective of machine learning.
Definition: Machine learning is a branch of artificial intelligence (AI) and computer science
which focuses on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy.
A collection of algorithms used in machine learning operate on vast amounts of data. These
algorithms are fed data to train them, and after training, they develop a model and carry out a
certain task.
Machine Learning algorithms are trained over instances or examples through which they learn
from past experiences and also analyze the historical data.
1
Unsupervised Learning – “I will find what to learn”
Reinforcement Learning – “I’ll learn from my mistakes at every step (Hit & Trial)”
Machine Learning
Types of Algorithm
Supervised learning
Supervised learning is the machine learning task of learning a function that maps an
input to an output based on example input-output pairs. It infers a function from
labeled training data consisting of a set of training examples.
Unsupervised learning
Unsupervised learning is a type of machine learning algorithm used to draw inferences
from datasets consisting of input data without labeled responses. The most
common unsupervised learning method is cluster analysis, which is used for exploratory data
analysis to find hidden patterns or grouping in data.
Unsupervised Learning Algorithms allow users to perform more complex processing tasks
compared to supervised learning.
Unsupervised learning can be more unpredictable compared with other natural learning methods.
Unsupervised learning algorithms include clustering, anomaly detection, neural networks, etc.
2
Reinforcement Learning
Reinforcement Learning is defined as a Machine Learning method that is concerned with how
software agents should take actions in an environment. Reinforcement Learning is a part of the
deep learning method that helps you to maximize some portion of the cumulative reward.
As cat doesn't understand English or any other human language, we can't tell her directly
what to do. Instead, we follow a different strategy. We emulate a situation, and the cat
tries to respond in many different ways. If the cat's response is the desired way, we will
give her fish.
Now whenever the cat is exposed to the same situation, the cat executes a similar action
with even more enthusiastically in expectation of getting more reward (food).
That's like learning that cat gets from "what to do" from positive experiences.
At the same time, the cat also learns what not do when faced with negative experiences.
Your cat is an agent that is exposed to the environment. In this case, it is your house. An
example of a state could be your cat sitting, and you use a specific word in for cat to
walk.
Our agent reacts by performing an action transition from one "state" to another "state."
For example, your cat goes from sitting to walking.
The reaction of an agent is an action, and the policy is a method of selecting an action
given a state in expectation of better outcomes.
After the transition, they may get a reward or penalty in return.
3
Machine learning and data processing
It helps you to create training systems that provide custom instruction and materials
according to the requirement of students.
Aircraft control and robot motion control
Machine Learning algorithms are trained over instances or examples through which they learn
from past experiences and also analyze the historical data.
Therefore, as it trains over the examples, again and again, it is able to identify patterns in order to
make predictions about the future.
Data Gathering
Algorithm
Model Building/ Selecting susitable ML
Training Model
Testing Model
Model
Evaluating
and
Deployment
All these steps in developing a Machine Learning Application as shown in Fig 2 are explained in
detail.
4
1. Data Gathering
The first stage of the machine learning life cycle is data gathering. This step's objective is to
locate and collect all data-related issues.
As data can be gathered from a variety of sources, including files, databases, the internet, and
mobile devices, we must first identify the various data sources in this stage. One of the most
crucial phases of the life cycle, it. The effectiveness of the output will depend on the quantity and
calibre of the data gathered. The prediction will be more accurate the more data there is.
2. Data Preprocessing
We have to prepare the data for further use after gathering it. Data preparation is the process of
organizing and preparing our data for use in machine learning training.
In this stage, we initially group all the data together before randomly arranging them.
This method can be separated into two different steps:
Data exploration is done to determine the type of data we are working with. We must recognize
the qualities, formats, and properties of the data.
Better we understand the data, we can get more accurate results. We discover correlations, broad
trends, and outliers in this.
Pre-processing of data:
The preparation of data for its analysis is the next step. It is also called as data cleaning or data
wrangling. In this process raw data is iterated till we get structured data for analysis.
Cleaning and transforming unusable raw data into a usable format is known as data wrangling. It
is the process of preparing the data for analysis in the following phase by properly formatting it,
choosing the variable to utilize, and cleaning the data. It is among the most crucial steps in the
entire procedure. In order to address the quality issues, data cleaning is necessary.
The information we have gathered may not always be beneficial to us; some of it may not even
be. The challenges that acquired data may have in real-world applications include:
i. Missing Values
ii. Duplicate data
iii. Invalid data
3. Data Modeling
5
The data has now been cleansed and readied for the analysis step. This action entails:
The goal of this step is to create a machine learning model that will analyze the data with a
variety of analytical methods and then evaluate the results.
4. Training a Model
We train our machine learning model to improve its performance for getting more accurate
results of the problem. Various machine learning algorithms are used to train the model using
datasets.
5. Testing a model
We test the machine learning model once it has been trained on a specific dataset. In this phase,
we give our model a test dataset to see if it is accurate. Testing the model establishes its accuracy
in percentages in relation to the requirements.
We implement the model in the actual system if it delivers an accurate output that is prediction
that meets our requirements quickly and as planned. However, we will first determine whether
the project is using the given data to improve performance before deploying it. The project's final
report is made during the deployment phase.
A poor data quality is a major issue for ML applications. Noisy data, incomplete data,
erroneous data, and unclean data produce low-quality outcomes and less accurate
classification.
We must make sure that the data preprocessing procedure, which involves eliminating
outliers, filtering out missing values, and eliminating undesired characteristics, is carried out
to the highest degree of accuracy.
6
2. The Process of Machine Learning is Complex
The machine learning sector is new and evolving rapidly. Experiments using quick hits and
trials are being conducted. The learning process is complicated since there are many
opportunities for error because the process is changing. It entails a variety of tasks, such as
data analysis, bias removal, data training, using sophisticated mathematical computations,
and much more. As a result, it is a tremendously sophisticated process and a significant
challenge for machine learning experts.
4. Slow Implementation
Slow programs, overloading of data, and extreme requirements are usually taking a lot of
time to provide accurate results in machine learning. It also requires constant monitoring and
maintenance to deliver the best outcomes.
5. Lack of skilled resources: As Machine learning and artificial intelligence algorithms are
comparatively new and still in development stage, they are continuously changing to get
mature. Machine learning methods need mathematical, technical and analytical support and
become complex to implement. Due to complex implementation methods, skilled man
power is essential to work with machine learning methods. As still growing area, there is
lack of such skilled resources.
Clustering automatically split the dataset into groups base on their similarities
Anomaly detection can discover unusual data points in your dataset. It is useful for
finding fraudulent transactions
Association mining identifies sets of items which often occur together in your dataset
Latent variable models are widely used for data preprocessing. Like reducing the number
of features in a dataset or decomposing the dataset into multiple components
As machine learning is becoming popular now a days, many fields are applying machine
learning methods. Following are some prominent areas of application of machine learning.
7
1. Image Processing: for image recognition and image matching are common applications
of machine learning in image processing. Used to identify, distinguish and match images
of persons, objects or places. Face recognition and object detection and tracking is
another important area. eg. Automatic friend tagging suggestion.
2. Speech recognition: is one of the popular application of machine learning. We can
operate a device or computer based on Speech commands. E.g. "Search by voice, “in
Google.
3. Prediction: Different prediction algorithms like, traffic prediction, predicting population
growth, stock market predictions are the distinguished applications of machine learning.
Eg. Google Maps can show shortest path and can predict traffic conditions or Share khan
can predict future stock prices of shares.
4. Product recommendations: E-commerce sites such as Amazon or entertainment
channels like Netflix etc. use ML for product recommendations to their users based on
previous views or search results.
5. Self-Driving vehicles: Self-driving automobiles are one of the most promising uses of
machine learning. Self-driving cars depend significantly on machine learning. E.g.The
world's largest automaker, Tesla, is developing a self-driving vehicle. In order to train the
car models to recognize people and objects while driving, unsupervised learning was
used.
6. Virtual personal assistant: based on voice and speech recognition, there are devices for
personal assistance. E.g., Alexa, Cortana, Siri, Google assistant are few common
applications.
7. Email Spam and Malware Filtering: Every new email that we get is immediately
classified as necessary, common, or spam. Machine learning is the technology that
enables us to consistently get essential emails in our inbox with the important symbol and
spam emails in our spam box.
8. Fraud Detection: Online banking transactions are not safe and secure due to various
frauds. Machine learning is making our online transaction safe and secure by detecting
fraud transaction
9. Medical Diagnosis: Machine learning plays a vital role in medical applications like
Automatic disease detection, patient monitoring, stroke detection, cancer detection etc.
E.g. 3-D machine learning models are built which can predict the exact position, growth
and stages of disease like cancer and many other.
10. Natural Language Processing: Machine learning algorithms in assistance with natural
language processing (NLP) can now process language-based inputs from people, such as
text messages sent through a company's website. These algorithms can identify the topic
and tone of a communication using NLP to learn more about what customers want. E.g.
Chatbots- that many businesses utilize Chatbots to generate automatic response and
answer customer questions submitted through their websites. or Google's GNMT (Google
Neural Machine Translation) which also is a machine learning that helps users to convert
the text into his/her known languages.
8
1.2.1 Training Error and generalization error
Be aware that the gap between forecasts and observed data is caused by model inaccuracy,
sampling error, and noise. While certain flaws can be minimised, others cannot. Even while
choosing the right approach and fine-tuning model parameters can increase the model's accuracy,
we will never be able to make predictions that are 100 percent accurate.
Training error is defined as the average loss that occurred during the training process. It is given
by:
Here, m_t is the size of the training set and loss function is the square of the difference between
the actual output and the predicted output. The above equation can be written as:
We can take the root of the above equation to calculate the Root Mean Square Error (RMSE). It
should be noted that the train error will be low as compared to the test error.
As the degree of complexity of the model is increased, the training error keeps on decreasing as
shown in Fig 3. Here, complexity is referred to as the number or complexity (x-power) of the
input features used in the model. Plotting the relationship looks like this:
The generalization error is measured in an independent test set. The generalization error vs.
model complexity can be plotted in Fig. 4 :
10
1.2.2 Bias and Variance in Machine Learning
Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the
machine learning algorithms. Or it is the difference between the predicted values and the actual
values. It is an assumptions made by a model to make a function easier to learn. It is actually the
error rate of the training data. When the error rate has a high value, we call it High Bias and
when the error rate has a low value, we call it low Bias.
Variance: If the machine learning model performs well with the training dataset, but does not
perform well with the test dataset, then variance occurs. It is the difference between the error
rates of training data and testing data is called variance. If the difference is high then it’s called
high variance and when the difference in errors is low then it’s called low variance. Usually, we
want to make a low variance for generalized our model.
11
Fig.5 Example of bias and varience
12
Low bias results when a data engineer modifies the ML algorithm to better fit a particular data
set, but variance increases higher. This increases the likelihood of incorrect predictions while
making the model suitable for the data set.
The same is true when building a high bias, low variance model. Although the probability of
incorrect predictions will be reduced, the model will still not accurately match the data set.
As shown in Fig. 6, Variance and bias are inversely related. i.e. any ML model cannot have both
a low bias and a low variance.
Increasing the complexity of the model- By making the model more complex by adjusting for
bias and variance, the overall bias is reduced while the variance is raised to an acceptable
amount. The model is matched to the training dataset in this way without introducing large
variance errors.
Increasing the training data set- This trade-off can also be somewhat balanced by expanding
the training data set. When dealing with overfitting models, this approach is preferred.
Additionally, this enables users to add complexity without introducing variance errors into the
model, as would happen with a large data collection.
Underfitting Overfitting
1. Underfitting
When a statistical model or machine learning algorithm is unable to recognize the underlying
pattern in the data, or when it only performs well on training data but badly on testing data, this
is referred to as underfitting.
13
When the model is unable to match the input data to the target data, underfitting occurs.
This occurs when the model gives poor performance with the training dataset and is not complex
enough to match all the available data.
2. Overfitting
When a model tries to match erroneous data, it is said to be overfit. When working with
extremely complicated models, this can happen since the model will virtually always match the
provided data points and perform well on training datasets. To accurately anticipate the outcome,
the model would not be able to generalize the data point in the test data set.
A statistical model is said to be overfitted when the model does not make accurate predictions on
testing data. When a model gets trained with so much data, it starts learning from the noise and
inaccurate data entries in our data set. And when testing with test data, it results in High
variance. Then the model does not categorize the data correctly, because of too many details and
noise. The causes of overfitting are the non-parametric and non-linear methods because these
types of machine learning algorithms have more freedom in building the model based on the
dataset and therefore they can really build unrealistic models. A solution to avoid overfitting is
using a linear algorithm if we have linear data or using the parameters like the maximal depth if
we are using decision trees.
14
i. Increase training data in a dataset.
ii. Reduce model complexity by simplifying the model by selecting one with fewer
parameters
iii. Ridge Regularization and Lasso Regularization
iv. Early stopping during the training phase
v. Reduce the noise
vi. Reduce the number of attributes in training data.
vii. Constraining the model.
Decision Trees:
One of the most effective supervised learning methods for both classification and regression
applications is the decision tree. They make excellent solutions for numerous machine learning
issues because they are simple to understand and straightforward. A decision tree simplifies
complex decision-making into a sequence of straightforward judgments based on features, which
ultimately result in a choice or a numerical value.
It creates a tree structure resembling a flowchart where each internal node represents a test on an
attribute, each branch a test result, and each leaf node (terminal node) a class label. A stopping
requirement, such as the maximum depth of the tree or the least number of samples needed to
split a node, is reached by repeatedly separating the training data into subsets based on the values
of the attributes.
15
The process of building a decision tree involves recursively splitting the data based on different
features to maximize the information gain or minimize the impurity at each step. Each internal
node in a decision tree for classification stands in for a test on a particular feature, each branch
for the result of the test, and each leaf node for a class label. The projected continuous values are
contained in the leaf nodes of a decision tree for regression.
The level of impurity or unpredictability in the subsets is measured using metrics like entropy or
Gini impurity, and the Decision Tree algorithm chooses the optimum attribute to split the data
depending on these metrics during training. Finding the property that maximizes information
gain or impurity reduction following the split is the objective.
Tree Terminology
Root node: The root node of a tree represents the entire dataset and is located at the top of the
tree. It is where the decision-making process begins.
Internal/Decision Node: A node that represents a decision regarding an input feature. Internal
nodes connect to leaf nodes or other internal nodes by branching off of them.
A leaf or terminal node: is a node that has no children and represents a class name or a number
value.
Splitting: The process of splitting a node into two or more sub-nodes based on a split criterion
and a chosen attribute.
Branch/Sub-Tree: The decision tree is divided into subsections, each of which begins at an
internal node and terminates at a leaf node.
Parent node: The node that divides into one or more child nodes is known as the parent node.
The nodes that appear when a child node emerges
Impurity: A measurement of the homogeneity of the target variable in a sample of data. It
speaks of the level of uncertainty or unpredictability in a collection of examples. In decision trees
used for classification tasks, the Gini index and entropy are two impurity metrics that are
frequently utilized.
Variance: Measured by variance is the degree to which the anticipated and the target variables
diverge among various dataset samples. It is applied to decision tree regression issues. The
variance for the decision tree's regression tasks is measured using Mean squared error, Mean
Absolute Error, friedman_mse, or Half Poisson deviance.
Information Gain: The amount of impurity reduced by segmenting a dataset based on a specific
feature in a decision tree is known as information gain. The feature that provides the maximum
information gain serves as the splitting criterion, which is used to choose the most informative
feature to split on at each node of the tree in order to produce pure subsets.
16
Pruning: is the procedure of eliminating from a tree any branches that don't provide any new
information or cause overfitting.
Root
SUBTREE
Node
SubTree
Decision Decision
Node Node
Decision
TN TN TN
Node
Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable, also
called as a predictor.
17
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in
the dataset, because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is
called underfitting.
Regression Analysis:
A predictive modelling method called regression analysis that analyses the relationship between
the target or dependent variable and independent variable in a dataset. When the target and
independent variables show a linear or non-linear relationship between one another and the target
variable has continuous values, one of the various types of regression analysis techniques is
applied. Regression analysis is frequently used to identify cause and effect relationships, forecast
trends, time series, and predictor strength.
Machine learning can be used to tackle the regression problem using two different types of
regression analysis techniques:
1. Logistic regression and 2. Linear regression.
They are the most renowned regression approaches. Regression analysis approaches in machine
learning come in a variety of forms, and their use depends on the type of data being used.
1. Linear regression: One of the most fundamental kinds of regression in machine learning
is linear regression. A predictor variable and a dependent variable that are linearly related
to one another make up the linear regression model. In case the data involves more than
one independent variable, then linear regression is called multiple linear regression
models.
18
Fig.8 Linear Regression
Linear regression has two major types - simple linear regression and multiple linear
regression. The formula for simple linear regression is-
To determine how strongly two variables are correlated such as the rate of global
warming and carbon emissions.
To determine the dependent variable's value based on an explicit independent
variable value. Finding the amount of atmospheric temperature increase associated
with a specific carbon dioxide emission, for instance.
2. Logistic regression
19
When the dependent variable is discrete, one form of regression analysis technique is used:
logistic regression. For instance, true or false, 0 or 1, etc. As a result, the target variable can only
take on two values, and the relationship between the target variable and the independent variable
is represented by a sigmoid curve, as shown in Fig 7.
In logistic regression, the logic function is used to quantify the connection between the
dependent and independent variables. The logistic regression equation is shown below.
It should be noted that while choosing logistic regression as the regression analyst approach, the
quantity of the data is significant and the occurrence of values in the target variables is almost
equal. Additionally, there shouldn't be any multicollinearity, which means that the dataset's
independent variables shouldn't be correlated with one another.
The probability of a binary event occurring is determined using logistic regression, which is also
used to solve categorization problems. For instance, detecting if an incoming email is spam or
not, or predicting whether a credit card transaction is fraudulent or not.
o Logistic regression is one of the most popular Supervised Learning technique in Machine
Learning. It is used for predicting the categorical dependent variable using a given set of
independent variables.
20
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
o The main difference between linear regression and logistic regression is how they are
used. While logistic regression is used to solve classification difficulties, linear regression
is used to solve regression problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Because it can categorize new data using both continuous and discrete datasets, logistic
regression is a key machine learning algorithm.
o Logistic regression may be used to categorize observations using a wide range of data
types and can quickly identify the variables that will work best for the classification. The
logistic function is depicted in the picture below.
Threshold
Assumptions:
21
There shouldn't be any multi-collinearity in the independent variable.
The equation for Logistic regression can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
o Binomial: In binomial Logistic regression, there are only two possible types of the
dependent variables, E.g. 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there are 3 or more possible unordered
types of the dependent variable, such as "apple", "sweet lemon", or "orange".
o Ordinal: In ordinal Logistic regression, there are 3 or more possible ordered types of
dependent variables, such as "High", "Medium", or “Low".
University Questions
22
Q. 1 What is Machine learning? How is it different from Data mining? (5M)
Q.2 Define Machine learning. Explain with example importance of Machine learning. (5M)
Q.3 What are the key tasks of Machine learning? (5M)
Q.4 Explain how supervised learning is different from unsupervised learning. (5M)
Q.5 Explain steps in developing Machine learning application. (5M)
Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy. In short machine learning is –
i. The ability for a machine to automatically learn from data,
ii. Enhance performance based on prior experiences, and
iii. Make predictions
Types of Machine Learning Algorithm
o Supervised Learning – “Teach me what to learn”
o Unsupervised Learning – “I will find what to learn”
o Reinforcement Learning – “I’ll learn from my mistakes at every step (Hit &
Trial)”
23
is high then it’s called high variance and when the difference in errors is low then it’s
called low variance. Usually, we want to make a low variance for generalized our model.
In supervised learning applications in machine learning and statistical learning theory,
generalization error (also known as the out-of-sample error) is a measure of how
accurately an algorithm is able to predict outcome values for previously unseen data.
[Wikipedia]
Training error is defined as the average loss that occurred during the training process. It
is given by:
24