Machine Learning and Web Scraping Lesson02
Machine Learning and Web Scraping Lesson02
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and Polygon.
Now the first step is that we need to train the model for each shape.
•If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
•If the given shape has three sides, then it will be labelled as a triangle.
•If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the shape on
the bases of a number of sides, and predicts the output.
Steps Involved in Supervised Learning:
•First Determine the type of training dataset
•Collect/Gather the labelled training data.
•Split the training dataset into training dataset, test dataset, and validation dataset.
•Determine the input features of the training dataset, which should have enough knowledge so that the model
can accurately predict the output.
•Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc.
•Execute the algorithm on the training dataset. Sometimes we need validation sets as the control parameters,
which are the subset of training datasets.
•Evaluate the accuracy of the model by providing the test set. If the model predicts the correct output, which
means our model is accurate.
Types of supervised Machine learning Algorithms:
Supervised learning can be further divided into two types of problems:
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output variable. It is
used for the prediction of continuous variables, such as Weather forecasting, Market Trends, etc. Below are some
popular Regression algorithms which come under supervised learning:
•Linear Regression
•Regression Trees
•Non-Linear Regression
•Bayesian Linear Regression
•Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are two classes
such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
•Random Forest
•Decision Trees
•Logistic Regression
•Support vector Machines
Advantages of Supervised learning:
•With the help of supervised learning, the model can predict the output on the basis of prior experiences.
•In supervised learning, we can have an exact idea about the classes of objects.
•Supervised learning model helps us to solve various real-world problems such as fraud detection, spam
filtering, etc.
Regression is a supervised learning technique which helps in finding the correlation between variables and
enables us to predict the continuous output variable based on the one or more predictor variables. It is mainly
used for prediction, forecasting, time series modeling, and determining the causal-effect relationship
between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the
machine learning model can make predictions about the data. In simple words, "Regression shows a line or
curve that passes through all the datapoints on target-predictor graph in such a way that the vertical
distance between the datapoints and the regression line is minimum." The distance between datapoints and
line tells whether a model has captured a strong relationship or not.
Some examples of regression can be as:
•Prediction of rain using temperature and other factors
•Determining Market trends
•Prediction of road accidents due to rash driving.
Terminologies Related to the Regression Analysis:
•Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called
the dependent variable. It is also called target variable.
•Independent Variable: The factors which affect the dependent variables or which are used to predict the
values of the dependent variables are called independent variable, also called as a predictor.
•Outliers: Outlier is an observation which contains either very low value or very high value in comparison to
other observed values. An outlier may hamper the result, so it should be avoided.
•Multicollinearity: If the independent variables are highly correlated with each other than other variables, then
such condition is called Multicollinearity. It should not be present in the dataset, because it creates problem while
ranking the most affecting variable.
•Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test dataset,
then such problem is called Overfitting. And if our algorithm does not perform well even with training dataset,
then such problem is called underfitting.
Why do we use Regression Analysis?
As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are various
scenarios in the real world where we need some future predictions such as weather condition, sales prediction,
marketing trends, etc., for such case we need some technology which can make predictions more accurately.
So for such case we need Regression analysis which is a statistical method and used in machine learning and
data science. Below are some other reasons for using Regression analysis:
•Regression estimates the relationship between the target and the independent variable.
•It is used to find the trends in data.
•It helps to predict real/continuous values.
•By performing the regression, we can confidently determine the most important factor, the least important
factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has its
own importance on different scenarios, but at the core, all the regression methods analyze the effect of the
independent variable on dependent variables. Here we are discussing some important types of regression which
are given below:
•Linear Regression
•Logistic Regression
•Polynomial Regression
•Support Vector Regression
•Decision Tree Regression
•Random Forest Regression
•Ridge Regression
•Lasso Regression:
Linear Regression:
•Linear regression is a statistical regression method which is used for predictive analysis.
•It is one of the very simple and easy algorithms which works on regression and shows the relationship between
the continuous variables.
•It is used for solving the regression problem in machine learning.
•Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent
variable (Y-axis), hence called linear regression.
•If there is only one input variable (x), then such linear regression is called simple linear regression. And if there
is more than one input variable, then such linear regression is called multiple linear regression.
•The relationship between variables in the linear regression model can be explained using the below image. Here
we are predicting the salary of an employee on the basis of the year of experience.
•Below is the mathematical equation for Linear regression:
Y= aX+b
Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients
Some popular applications of linear regression are:
•Analyzing trends and sales estimates
•Salary forecasting
•Real estate prediction
•Arriving at ETAs in traffic.
Logistic Regression:
•Logistic regression is another supervised learning algorithm which is used to solve the classification problems.
In classification problems, we have dependent variables in a binary or discrete format such as 0 or 1.
•Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or False, Spam
or not spam, etc.
•It is a predictive analysis algorithm which works on the concept of probability.
•Logistic regression is a type of regression, but it is different from the linear regression algorithm in the term how
they are used.
•Logistic regression uses sigmoid function or logistic function which is a complex cost function. This sigmoid
function is used to model the data in logistic regression. The function can be represented as: