0% found this document useful (0 votes)
14 views

Data Analytics Unit1

The document discusses predictive analytics and supervised machine learning techniques. It defines predictive analytics and its key components, including data collection, cleaning, feature selection, model building, training, validation, and deployment. It then explains supervised learning, the process of training models on labeled data to predict outputs, and provides examples of supervised learning techniques for regression and classification problems.

Uploaded by

antaryami barik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Data Analytics Unit1

The document discusses predictive analytics and supervised machine learning techniques. It defines predictive analytics and its key components, including data collection, cleaning, feature selection, model building, training, validation, and deployment. It then explains supervised learning, the process of training models on labeled data to predict outputs, and provides examples of supervised learning techniques for regression and classification problems.

Uploaded by

antaryami barik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Predictive Analysis: Linear methods for Regression

and Classification
Data Analysis is a process of studying, cleaning, modeling, and transforming data with the
purpose of finding useful information, suggesting conclusions, and supporting decision-
making.

Data Analysis Process


Data Analysis is developed by the statistician John Tukey in the 1970s. It is a procedure for
analyzing data, methods for interpreting the results of such systems, and modes of planning the
group of data to make its analysis easier, more accurate, or more factual.
Therefore, data analysis is a process for getting large, unstructured data from different sources
and converting it into information that is gone through the below process:
• Data Requirements Specification
• Data Collection
• Data Processing
• Data Cleaning
• Data Analysis
• Communication

Need for Data Analysis


Data analytics is significant for business optimization performance. An organization can also
use data analytics to make better business decisions and support analyzing customer trends and
fulfillment, which can lead to unknown and better products and services. Executing it into the
business model indicates businesses can help reduce costs by recognizing more efficient modes
of doing business.
Tools Used in Data Analysis
• Microsoft Excel
• Python
• R
• Jupyter Notebook
• Apache Spark
• SAS
• Microsoft Power BI
• Tableau
• KNIME
Predictive Analytics

Predictive analytics is a branch of advanced analytics that utilizes statistical algorithms and
machine learning techniques to analyze historical data and make predictions about future events
or trends. It involves the use of data, statistical algorithms, and machine learning models to
identify the likelihood of future outcomes based on patterns and trends observed in historical
data.

Key components and steps in predictive analytics include:

Data Collection: Gathering relevant data from various sources, including historical records,
databases, and other data repositories.
Data Cleaning and Preprocessing: Ensuring data quality by handling missing values, outliers,
and formatting issues to prepare the data for analysis.
Feature Selection: Identifying and selecting the most relevant variables or features that
contribute to the predictive model.
Model Building: Using statistical algorithms or machine learning models to analyze historical
data and create a predictive model. Common techniques include linear regression, decision trees,
random forests, support vector machines, and neural networks.
Training the Model: Using historical data to train the predictive model, allowing it to learn
patterns and relationships within the data.
Validation and Testing: Assessing the accuracy and performance of the predictive model using a
separate set of data that it has not seen before. This helps ensure the model's generalization to
new, unseen data.
Deployment: Implementing the predictive model in real-world scenarios to make predictions on
new data.
Monitoring and Updating: Continuously monitoring the model's performance and updating it as
needed to maintain accuracy, especially as new data becomes available.

Predictive analytics is applied in various fields, including finance, healthcare, marketing,


manufacturing, and more. Examples of its use include predicting customer behavior, stock
prices, equipment failures, disease outbreaks, and many other future events. It enables
organizations to make more informed decisions, optimize processes, and gain a competitive
advantage by anticipating outcomes and taking proactive measures.

Supervised learning

Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram:

Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:


o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine, decision
tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as the
control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the
correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Advantages of Supervised learning:

o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.

Disadvantages of supervised learning:

o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.

Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as compared to supervised


learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.
Regression Analysis in Machine learning

Regression analysis is a statistical method to model the relationship between a dependent (target)
and independent (predictor) variables with one or more independent variables. More specifically,
Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed. It
predicts continuous/real values such as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every year
and get sales on that. The below list shows the advertisement made by the company in the last 5
years and the corresponding sales:

Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints, using
this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model has
captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors


o Determining Market trends
o Prediction of road accidents due to rash driving.

Types of Regression

There are various types of regressions which are used in data science and machine learning. Each
type has its own importance on different scenarios, but at the core, all the regression methods
analyze the effect of the independent variable on dependent variables. Here we are discussing
some important types of regression which are given below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:

o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis)
and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the
year of experience.
o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on
X-axis, then such a relationship is termed as a Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on
the X-axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will have
the least error.
T he different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.

Cost function-

o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps
the input variable to the output variable. This mapping function is also known
as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.

Gradient Descent:

o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.

Model Performance:

The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be achieved
by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values
and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple determination for
multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from the
given dataset.

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and independent
variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will become
either too wide or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any
deviation, which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.

What Is the Least Squares Method?

The least squares method is a form of mathematical regression analysis used to determine
the line of best fit for a set of data, providing a visual demonstration of the relationship between
the data points. Each point of data represents the relationship between a known independent
variable and an unknown dependent variable. This method is commonly used by statisticians
and traders who want to identify trading opportunities and trends.

The least squares method is a form of regression analysis that provides the overall rationale for
the placement of the line of best fit among the data points being studied. It begins with a set of
data points using two variables, which are plotted on a graph along the x- and y-axis. Traders
and analysts can use this as a tool to pinpoint bullish and bearish trends in the market along with
potential trading opportunities.

The most common application of this method is sometimes referred to as linear or ordinary. It
aims to create a straight line that minimizes the sum of squares of the errors generated by the
results of the associated equations, such as the squared residuals resulting from differences in
the observed value and the value anticipated based on that model.

Advantages and Disadvantages of the Least Squares Method


The best way to find the line of best fit is by using the least squares method. But traders and
analysts may come across some issues, as this isn't always a fool-proof way to do so. Some of
the pros and cons of using this method are listed below.

Advantages

One of the main benefits of using this method is that it is easy to apply and understand. That's
because it only uses two variables (one that is shown along the x-axis and the other on the y-
axis) while highlighting the best relationship between them.

Investors and analysts can use the least square method by analyzing past performance and
making predictions about future trends in the economy and stock markets. As such, it can be
used as a decision-making tool.

Disadvantages

The primary disadvantage of the least square method lies in the data used. It can only highlight
the relationship between two variables. As such, it doesn't take any others into account. And if
there are any outliers, the results become skewed.

Another problem with this method is that the data must be evenly distributed. If this isn't the
case, the results may not be reliable.

Multiple Linear Regression


In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there
may be various cases in which the response variable is affected by more than one
predictor variable; for such cases, the Multiple Linear Regression algorithm is used.

Moreover, Multiple Linear Regression is an extension of Simple Linear regression as it


takes more than one predictor variable to predict the response variable.

Some key points about MLR:

o For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.

Logistic Regression in Machine Learning


o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true
or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

You might also like