Data Analytics Unit1
Data Analytics Unit1
and Classification
Data Analysis is a process of studying, cleaning, modeling, and transforming data with the
purpose of finding useful information, suggesting conclusions, and supporting decision-
making.
Predictive analytics is a branch of advanced analytics that utilizes statistical algorithms and
machine learning techniques to analyze historical data and make predictions about future events
or trends. It involves the use of data, statistical algorithms, and machine learning models to
identify the likelihood of future outcomes based on patterns and trends observed in historical
data.
Data Collection: Gathering relevant data from various sources, including historical records,
databases, and other data repositories.
Data Cleaning and Preprocessing: Ensuring data quality by handling missing values, outliers,
and formatting issues to prepare the data for analysis.
Feature Selection: Identifying and selecting the most relevant variables or features that
contribute to the predictive model.
Model Building: Using statistical algorithms or machine learning models to analyze historical
data and create a predictive model. Common techniques include linear regression, decision trees,
random forests, support vector machines, and neural networks.
Training the Model: Using historical data to train the predictive model, allowing it to learn
patterns and relationships within the data.
Validation and Testing: Assessing the accuracy and performance of the predictive model using a
separate set of data that it has not seen before. This helps ensure the model's generalization to
new, unseen data.
Deployment: Implementing the predictive model in real-world scenarios to make predictions on
new data.
Monitoring and Updating: Continuously monitoring the model's performance and updating it as
needed to maintain accuracy, especially as new data becomes available.
Supervised learning
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
Regression analysis is a statistical method to model the relationship between a dependent (target)
and independent (predictor) variables with one or more independent variables. More specifically,
Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed. It
predicts continuous/real values such as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year
and get sales on that. The below list shows the advertisement made by the company in the last 5
years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using
this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model has
captured a strong relationship or not.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each
type has its own importance on different scenarios, but at the core, all the regression methods
analyze the effect of the independent variable on dependent variables. Here we are discussing
some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis)
and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the
year of experience.
o Below is the mathematical equation for Linear regression:
1. Y= aX+b
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on
X-axis, then such a relationship is termed as a Positive linear relationship.
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will have
the least error.
T he different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps
the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:
Where,
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be achieved
by below method:
1. R-squared method:
Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from the
given dataset.
The least squares method is a form of mathematical regression analysis used to determine
the line of best fit for a set of data, providing a visual demonstration of the relationship between
the data points. Each point of data represents the relationship between a known independent
variable and an unknown dependent variable. This method is commonly used by statisticians
and traders who want to identify trading opportunities and trends.
The least squares method is a form of regression analysis that provides the overall rationale for
the placement of the line of best fit among the data points being studied. It begins with a set of
data points using two variables, which are plotted on a graph along the x- and y-axis. Traders
and analysts can use this as a tool to pinpoint bullish and bearish trends in the market along with
potential trading opportunities.
The most common application of this method is sometimes referred to as linear or ordinary. It
aims to create a straight line that minimizes the sum of squares of the errors generated by the
results of the associated equations, such as the squared residuals resulting from differences in
the observed value and the value anticipated based on that model.
Advantages
One of the main benefits of using this method is that it is easy to apply and understand. That's
because it only uses two variables (one that is shown along the x-axis and the other on the y-
axis) while highlighting the best relationship between them.
Investors and analysts can use the least square method by analyzing past performance and
making predictions about future trends in the economy and stock markets. As such, it can be
used as a decision-making tool.
Disadvantages
The primary disadvantage of the least square method lies in the data used. It can only highlight
the relationship between two variables. As such, it doesn't take any others into account. And if
there are any outliers, the results become skewed.
Another problem with this method is that the data must be evenly distributed. If this isn't the
case, the results may not be reliable.
o For MLR, the dependent or target variable(Y) must be the continuous/real, but the
predictor or independent variable may be of continuous or categorical form.
o Each feature variable must model the linear relationship with the dependent variable.
o MLR tries to fit a regression line through a multidimensional space of data-points.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".