Unit II Data Science Notes
Unit II Data Science Notes
FREQUENCY DISTRIBUTIONS:
Frequency Distribution is a tool in statistics that helps us organize the data and also
helps us reach meaningful conclusions. It tells us how often any specific values occur in
the dataset.
A frequency distribution represents the pattern of how frequently each value of a variable
appears in a dataset. It shows the number of occurrences for each possible value within the
dataset.
Let’s learn about Frequency Distribution including its definition, graphs, solved
examples, and frequency distribution table in detail.
Connects midpoints of
class frequencies using Comparing various
Frequency Polygon
lines, similar to a datasets.
histogram but without bars.
0-20 6
20-40 12
40-60 22
60-80 15
80-100 5
10 – 20 5
20 – 30 12
30 – 40 8
40 – 50 15
50 – 60 20
In the above table, we can see there are two columns. The first column represents the
number of cattle, and the second column represents the number of families who own
the associate number of cattle. As the first column is grouped with a certain interval
length, thus this table is an example of Grouped Frequency Distribution.
Frequency Distribution Table for Ungrouped Data
An ungrouped frequency distribution table is a statistical table that organizes individual
data values along with their corresponding frequencies instead of groups or class intervals it
mainly works on ungrouped data.
For example, consider the number of vowels in any given paragraph.
Vowel Frequency
a 7
e 10
i 7
o 6
u 3
In the above table, the two columns representing a list of vowels and their frequency in any
given paragraph. As the first column is a list of some individual elements, thus this table is
an example of Ungrouped Frequency Distribution.
Types of Frequency Distribution
There are four types of frequency distribution :
Grouped Frequency Distribution
Ungrouped Frequency Distribution
Relative Frequency Distribution
Cumulative Frequency Distribution
Grouped Frequency Distribution
In Grouped Frequency Distribution observations are divided between different
intervals known as class intervals and then their frequencies are counted for each class
interval. This Frequency Distribution is used mostly when the data set is very large.
Example: Make the Frequency Distribution Table for the ungrouped data given as
follows:
23, 27, 21, 14, 43, 37, 38, 41, 55, 11, 35, 15, 21, 24, 57, 35, 29, 10, 39, 42, 27, 17, 45, 52,
31, 36, 39, 38, 43, 46, 32, 37, 25
Solution:
As there are observations in between 10 and 57, we can choose class intervals as 10-20, 20-
30, 30-40, 40-50, and 50-60. In these class intervals all the observations are covered and for
each interval there are different frequency which we can count for each interval.
Thus, the Frequency Distribution Table for the given data is as follows:
Class Interval Frequency
10 – 20 5
20 – 30 8
30 – 40 12
40 – 50 6
50 – 60 3
10 4
15 3
20 2
25 3
Value Frequency
30 2
Frequency 5 10 20 10 5
Solution:
To Create the Relative Frequency Distribution table, we need to calculate Relative
Frequency for each class interval. Thus Relative Frequency Distribution table is given as
follows:
Score Range Frequency Relative Frequency
Total 50 1.00
56 63 70 49 33
0 8 14 39 86
92 88 70 56 50
57 45 42 12 39
Solution:
Since there are a lot of distinct values, we’ll express this in the form of grouped
distributions with intervals like 0-10, 10-20 and so. First let’s represent the data in
the form of grouped frequency distribution.
Runs Frequency
0-10 2
10-20 2
20-30 1
30-40 4
40-50 4
50-60 5
60-70 1
70-80 3
80-90 2
90-100 1
Less than 10 2
Runs scored by Virat Kohli Cumulative Frequency
Less than 20 4
Less than 30 5
Less than 40 9
Less than 50 13
Less than 60 18
Less than 70 19
Less than 80 22
Less than 90 24
This table represents the cumulative frequency distribution of less than type.
Runs scored by Virat Kohli Cumulative Frequency
More than 0 25
More than 10 23
More than 20 21
More than 30 20
More than 40 16
More than 50 12
More than 60 7
More than 70 6
More than 80 3
Runs scored by Virat Kohli Cumulative Frequency
More than 90 1
This table represents the cumulative frequency distribution of more than type.
We can plot both the type of cumulative frequency distribution to make
the Cumulative Frequency Curve .
Frequency Distribution Curve
A frequency distribution curve, also known as a frequency curve, is a graphical
representation of a data set’s frequency distribution. It is used to visualize the
distribution and frequency of values or observations within a dataset.
Value Frequency
1 2
2 6
Value Frequency
3 2
4 4
Total 14
Example 4: The table below gives the values of temperature recorded in Hyderabad
for 25 days in summer. Represent the data in the form of less-than-type cumulative
frequency distribution:
37 34 36 27 22
25 25 24 26 28
30 31 29 28 30
32 31 28 27 30
30 32 35 34 29
Solution:
Since there are so many distinct values here, we will use grouped frequency distribution.
Let’s say the intervals are 20-25, 25-30, 30-35. Frequency distribution table can be made
by counting the number of values lying in these intervals.
Temperature Number of Days
20-25 2
25-30 10
30-35 13
This is the grouped frequency distribution table. It can be converted into cumulative
frequency distribution by adding the previous values.
Temperature Number of Days
Less than 25 2
Less than 30 12
Less than 35 25
Example 5: Make a Frequency Distribution Table as well as the curve for the data:
{45, 22, 37, 18, 56, 33, 42, 29, 51, 27, 39, 14, 61, 19, 44, 25, 58, 36, 48, 30, 53, 41, 28, 35,
47, 21, 32, 49, 16, 52, 26, 38, 57, 31, 59, 20, 43, 24, 55, 17, 50, 23, 34, 60, 46, 13, 40, 54,
15, 62}
Solution:
To create the frequency distribution table for given data, let’s arrange the data in
ascending order as follows:
{13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62}
Now, we can count the observations for intervals: 10-20, 20-30, 30-40, 40-50, 50-60 and
60-70.
Interval Frequency
10 – 20 7
20 – 30 10
30 – 40 10
40 – 50 10
50 – 60 10
60 – 70 3
From this data, we can plot the Frequency Distribution Curve as follows:
OUTLIERS
Outliers are extreme values that differ from most other data points in a dataset. They can
have a big impact on your statistical analyses and skew the results of any hypothesis tests.
It’s important to carefully identify potential outliers in your dataset and deal with them in an
appropriate manner for accurate results.
There are four ways to identify outliers:
1. Sorting method
Sorting method
You can sort quantitative variables from low to high and scan for extremely low or extremely
high values. Flag any extreme values that you find.
This is a simple way to check whether you need to investigate certain data points before using
more sophisticated methods.
Example: Sorting methodYour dataset for a pilot experiment consists of 8 values.
180 156 9 176 163 1827 166 171
You sort the values from low to high and scan for extreme values.
9 156 163 166 171 176 180 1872
Using visualizations
You can use software to visualize your data with a box plot, or a box-and-whisker plot, so
you can see the data distribution at a glance. This type of chart highlights minimum and
maximum values (the range), the median, and the interquartile range for your data.
Many computer programs highlight an outlier on a chart with an asterisk, and these will lie
outside the bounds of the graph.
This method is helpful if you have a few values on the extreme ends of your dataset, but you
aren’t sure whether any of them might count as outliers.
Interquartile range method
Next, we’ll use the exclusive method for identifying Q1 and Q3. This means we remove the
median from our calculations.
The Q1 is the value in the middle of the first half of your dataset, excluding the median. The
first quartile value is 25.
22 24 26 28 29
Your Q3 value is in the middle of the second half of your dataset, excluding the median. The
third quartile value is 41.
35 37 41 53 64
IQR = Q3 – Q1 Q1 = 26
Q3 = 41
IQR = 41 – 26
= 15
Does the outlier line up with other measurements taken from the same participant?
Is this data point completely impossible or can it reasonably come from
your population?
What’s the most likely source of the outlier? Is it a natural variation or an error?
In general, you should try to accept outliers as much as possible unless it’s clear that they
represent errors or bad data.
Retain outliers
Just like with missing values, the most conservative option is to keep outliers in your dataset.
Keeping outliers is usually the better option when you’re not sure if they are errors.
With a large sample, outliers are expected and more likely to occur. But each outlier has less
of an effect on your results when your sample is large enough. The central
tendency and variability of your data won’t be as affected by a couple of extreme values
when you have a large number of values.
If you have a small dataset, you may also want to retain as much data as possible to make
sure you have enough statistical power. If your dataset ends up containing many outliers, you
may need to use a statistical test that’s more robust to them. Non-parametric statistical tests
perform better for these data.
Remove outliers
Outlier removal means deleting extreme values from your dataset before you
perform statistical analyses. You aim to delete any dirty data while retaining true extreme
values.
It’s a tricky procedure because it’s often impossible to tell the two types apart for sure.
Deleting true outliers may lead to a biased dataset and an inaccurate conclusion.
For this reason, you should only remove outliers if you have legitimate reasons for doing so.
It’s important to document each outlier you remove and your reasons so that other researchers
can follow your procedures.
VARIABILITY FOR QUALITATIVE AND RANKED DATA
A measure of variability is a value that indicates how varied, or spread out, a data set is. The
simplest measure of variability is the range. It measures the variation of the data by finding
the difference in the maximum data value and the minimum data value. The obvious flaw
with this type of measurement is that it only takes the most extreme data values into account
and is therefore very sensitive to outliers. As we will see, other measures like the interquartile
range, standard deviation and variance use the entire data set, so they are not as sensitive.
Formulas for Measures of Variability
Variance = 𝜎2=∑𝑖=1𝑛(𝑥𝑖−𝜇)2𝑛
x x-67.4 (x-67.4)^2
70 2.6 6.76
74 6.6 43.56
62 -5.4 29.16
68 0.6 0.36
65 -2.4 5.76
70 2.6 6.76
69 1.6 2.56
63 -4.4 19.36
67 -0.4 0.16
66 -1.4 1.96
If we sum the rightmost column, we will find ∑𝑖=110(𝑥𝑖−𝜇)2=116.4
Thus, the variance is 𝜎2=116.410=11.64.
Notice that the standard deviation is simply the square root of the variance.
Thus, the standard deviation is 𝜎=11.64=3.412.
NORMAL DISTRIBUTIONS
In a normal distribution, data is symmetrically distributed with no skew. When plotted on a
graph, the data follows a bell shape, with most values clustering around a central region and
tapering off as they go further away from the center.
Normal distributions are also called Gaussian distributions or bell curves because of their
shape. In a normal distribution, data is symmetrically distributed with no skew. When plotted
on a graph, the data follows a bell shape, with most values clustering around a central
region and tapering off as they go further away from the center.
Normal distributions are also called Gaussian distributions or bell curves because of their
shape.
Normal distributions have key characteristics that are easy to spot in graphs:
The mean, median and mode are exactly the same.
The distribution is symmetric about the mean—half the values fall below the mean
and half above the mean.
The distribution can be described by two values: the mean and the standard deviation.
The mean is the location parameter while the standard deviation is the scale parameter.
The mean determines where the peak of the curve is centered. Increasing the mean moves the
curve right, while decreasing it moves the curve left.
The standard deviation stretches or squeezes the curve. A small standard deviation results in a
narrow curve, while a large standard deviation leads to a wide curve.
Empirical rule
The empirical rule, or the 68-95-99.7 rule, tells you where most of your values lie in a
normal distribution:
Around 68% of values are within 1 standard deviation from the mean.
Around 95% of values are within 2 standard deviations from the mean.
Around 99.7% of values are within 3 standard deviations from the mean.
Example: Using the empirical rule in a normal distributionYou collect SAT scores from
students in a new test preparation course. The data follows a normal distribution with a mean
score (M) of 1150 and a standard deviation (SD) of 150.
Following the empirical rule:
Around 68% of scores are between 1,000 and 1,300, 1 standard deviation above and
below the mean.
Around 95% of scores are between 850 and 1,450, 2 standard deviations above and
below the mean.
Around 99.7% of scores are between 700 and 1,600, 3 standard deviations above and
below the mean.
The empirical rule is a quick way to get an overview of your data and check for any outliers
or extreme values that don’t follow this pattern.
If data from small samples do not closely follow this pattern, then other distributions like
the t-distribution may be more appropriate. Once you identify the distribution of your
variable, you can apply appropriate statistical tests.
Once you have the mean and standard deviation of a normal distribution, you can fit a normal
curve to your data using a probability density function.
In a probability density function, the area under the curve tells you probability. The normal
distribution is a probability distribution, so the total area under the curve is always 1 or
100%.
The formula for the normal probability density function looks fairly complicated. But to use
it, you only need to know the population mean and standard deviation.
For any value of x, you can plug in the mean and standard deviation into the formula to find
the probability density of the variable taking on that value of x.
Normal probability density formula Explanation
f(x) = probability
x = value of the variable
μ = mean
σ = standard deviation
σ2 = variance
While individual observations from normal distributions are referred to as x, they are referred
to as z in the z-distribution. Every normal distribution can be converted to the standard
normal distribution by turning the individual values into z-scores.
Z-scores tell you how many standard deviations away from the mean each value lies.
You only need to know the mean and standard deviation of your distribution to find the z-
score of a value.
Z-score Explanation
Formula
x = individual value
μ = mean
σ = standard deviation
We convert normal distributions into the standard normal distribution for several reasons:
For a z-score of 1.53, the p-value is 0.937. This is the probability of SAT scores being 1380
or less (93.7%), and it’s the area under the curve left of the shaded area.
To find the shaded area, you take away 0.937 from 1, which is the total area under the curve.
Probability of x > 1380 = 1 – 0.937 = 0.063
That means it is likely that only 6.3% of SAT scores in your sample exceed 1380.
CORRELATION
Correlation analysis is a statistical technique for determining the strength of a link between
two variables. It is used to detect patterns and trends in data and to forecast future
occurrences.
Consider a problem with different factors to be considered for making
optimal conclusions
Correlation explains how these variables are dependent on each other.
Correlation quantifies how strong the relationship between two variables
is. A higher value of the correlation coefficient implies a stronger association.
The sign of the correlation coefficient indicates the direction of the
relationship between variables. It can be either positive, negative, or zero.
The Pearson correlation coefficient is the most often used metric of correlation. It expresses
the linear relationship between two variables in numerical terms. The Pearson correlation
coefficient, written as “r,” is as follows:
𝑟=∑(𝑥𝑖−𝑥ˉ)(𝑦𝑖−𝑦ˉ)∑(𝑥𝑖−𝑥ˉ)2∑(𝑦𝑖−𝑦ˉ)2r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
where,
r: Correlation coefficient
𝑥𝑖xi : i^th value first dataset X
𝑥ˉxˉ : Mean of first dataset X
𝑦𝑖yi : i^th value second dataset Y
𝑦ˉyˉ : Mean of second dataset Y
The correlation coefficient , denoted by “r”, ranges between -1 and 1.
r = -1 indicates a perfect negative correlation.
r = 0 indicates no linear correlation between the variables.
r = 1 indicates a perfect positive correlation.
Types of Correlation
There are three types of correlation:
SCATTER PLOTS
A scatter plot, also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter
diagram, is a type of plot or mathematical diagram using Cartesian coordinates to display
values for typically two variables for a set of data.
A scatter plot is a diagram where each value in the data set is represented by a dot.
The Matplotlib module has a method for drawing scatter plots, it needs two arrays of the
same length, one for the values of the x-axis, and one for the values of the y-axis:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Example
import sys
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
#Two lines to make our compiler able to draw:
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
Result:
REGRESSION
Regression is a statistical approach used to analyze the relationship between a dependent
variable (target variable) and one or more independent variables (predictor variables). The
objective is to determine the most suitable function that characterizes the connection
between these variables.
It seeks to find the best-fitting model, which can be utilized to make predictions or draw
conclusions.
Regression in Machine Learning
It is a supervised machine learning technique, used to predict the value of the dependent
variable for new, unseen data. It models the relationship between the input features and the
target variable, allowing for the estimation or prediction of numerical values.
Regression analysis problem works with if output variable is a real or continuous value,
such as “salary” or “weight”. Many different models can be used, the simplest is the linear
regression. It tries to fit data with the best hyper-plane which goes through the points.
erminologies Related to the Regression Analysis in Machine Learning
Terminologies Related to Regression Analysis:
Response Variable: The primary factor to predict or understand in
regression, also known as the dependent variable or target variable.
Predictor Variable: Factors influencing the response variable, used to
predict its values; also called independent variables.
Outliers: Observations with significantly low or high values compared to
others, potentially impacting results and best avoided.
Multicollinearity: High correlation among independent variables, which
can complicate the ranking of influential variables.
Underfitting and Overfitting : Overfitting occurs when an algorithm
performs well on training but poorly on testing, while underfitting indicates poor
performance on both datasets.
Regression Types
There are two main types of regression:
Simple Regression
o Used to predict a continuous dependent variable based
on a single independent variable.
o Simple linear regression should be used when there is
only a single independent variable.
Multiple Regression
o Used to predict a continuous dependent variable based
on multiple independent variables.
o Multiple linear regression should be used when there
are multiple independent variables.
NonLinear Regression
o Relationship between the dependent variable and
independent variable(s) follows a nonlinear pattern.
o Provides flexibility in modeling a wide range of
functional forms.
Regression Algorithms
There are many different types of regression algorithms, but some of the most common
include:
Linear Regression
Linear regression is one of the simplest and most widely used statistical models. This
assumes that there is a linear relationship between the independent and dependent variables.
This means that the change in the dependent variable is proportional to the change in the
independent variables.
Polynomial Regression
Polynomial regression is used to model nonlinear relationships between the dependent
variable and the independent variables. It adds polynomial terms to the linear regression
model to capture more complex relationships.
Support Vector Regression (SVR)
Support vector regression (SVR) is a type of regression algorithm that is based on the
support vector machine (SVM) algorithm. SVM is a type of algorithm that is used for
classification tasks, but it can also be used for regression tasks. SVR works by finding a
hyperplane that minimizes the sum of the squared residuals between the predicted and
actual values.
Decision Tree Regression
Decision tree regression is a type of regression algorithm that builds a decision tree to
predict the target value. A decision tree is a tree-like structure that consists of nodes and
branches. Each node represents a decision, and each branch represents the outcome of that
decision. The goal of decision tree regression is to build a tree that can accurately predict
the target value for new data points.
Random Forest Regression
Random forest regression is an ensemble method that combines multiple decision trees to
predict the target value. Ensemble methods are a type of machine learning algorithm that
combines multiple models to improve the performance of the overall model. Random forest
regression works by building a large number of decision trees, each of which is trained on a
different subset of the training data. The final prediction is made by averaging the
predictions of all of the trees.
Y = df['price']
X = df['lotsize']
X=X.values.reshape(len(X),1)
Y=Y.values.reshape(len(Y),1)
# Plot outputs
plt.scatter(X_test, Y_test, color='black')
plt.title('Test Data')
plt.xlabel('Size')
plt.ylabel('Price')
plt.xticks(())
plt.yticks(())
# Plot outputs
plt.plot(X_test, regr.predict(X_test), color='red',linewidth=3)
plt.show()
OUTPUT
REGRESSION LINE
Regression Line is defined as a statistical concept that facilitates and predicts the
relationship between two or more variables. A regression line is a straight line that reflects
the best-fit connection in a dataset between independent and dependent variables. The
independent variable is generally shown on the X-axis and the dependent variable is shown
on the Y-axis. The main purpose of developing a regression line is to predict or estimate the
value of the dependent variable based on the values of one or more independent variables.
Equation of Regression Line
The equation of a simple linear regression line is given by:
Y = a + bX + ε
Y is the dependent variable
X is the independent variable
a is the y-intercept, which represents the value of Y when X is 0.
b is the slope, which represents the change in Y for a unit change in X
ε is residual error.
Examples of Regression Line
Example 1:
A function facilitates the calculation of marks scored by the students when the number of
hours studied by them is given. The slope and y-intercept of the given function are 5 and 50
respectively. Using this information, form a regression line equation.
Solution:
In case of calculation of marks scored by students, when the numbers of hours each of them
studied are given, Marks will be the dependent variable (i.e. marks will be represented by
Y) and number of hours studied will be the dependant variable (i.e. number of hours
studied by the students will be represented by X). Now, the general linear regression
equation is Y = a + bX.
We have been given that the y-intercept is 50, (i.e., a = 50) and the respective slope is 5,
(i.e. b = 5).
Therefore, the required equation of regression line will be,
Y = 50 + 5X + ε
Example 2:
In continuation with the above example, the figures of three students are given as follows:
Student 1: Studied for 2 hours and scored 60 marks.
Student 2: Studied for 3 hours and scored 65 marks.
What will the marks scored by the 4th student in case he/she studies for 5 hours.
Solution:
The required equation of regression line as calculated in previous example is,
Y = 50 + 5X
In case of 4th student, who studies for 5 hours (X = 5), the marks scored by him will be
calculated as,
Y = 50 + 5X.
Y = 50 + 5(5)
Y = 75 Marks
Types of Regression Lines
1. Linear Regression Line: Linear regression line is utilised when there is a linear
relationship between the reliant variable and at least one free variables. The condition of a
straightforward linear relapse line is typically; Y = a + bX + ε, where Y is the reliant
variable, X is the free variable, a is the y-intercept, b is the slope, and ε is error.
2. Logistic Regression Line: Logistic regression is used when the dependent variable is
discrete. It models the probability of a binary outcome using a logistic function. The
equation is typically expressed as the log-odds of the probability.
3. Polynomial Regression Line: Polynomial regression is used when the relationship
between the dependent and independent variables is best represented by a polynomial
equation. The equation is Y = aX2 + bX + c, or even higher-order polynomial equations.
4. Ridge and Lasso Regression: These are used for regularisation in linear regression.
Ridge and Lasso add penalty terms to the linear regression equation to prevent overfitting
and perform feature selection.
5. Non-Linear Regression Line: For situations where the relationships between variables
is not linear, non-linear regression lines must be used to defined the relationship.
6. Multiple Regression Line: This involves multiple independant variables to predict a
dependant variable. It is an extension of linear regression.
7. Exponential Regression Line: Exponential Regression Line is formed when the data
follows an exponential growth or decay pattern. It is often seen in fields like biology,
finance, and physics.
8. Pricewise Regression Line: In this approach, the data is divided into segments, and a
different linear or no linear model is applied to each segment.
9. Time Series Regression Line: This approach is used to deal with time-series data, and
models how the dependent variable changes over time.
10. Power Regression Line: This type of regression line is used when one variable
increases at a power of another. It can be applied to situations where exponential growth
does not fit.
Applications of Regression Line
Regression lines have numerous uses in a variety of domains, including:
1. Economics: Regression analysis is used in economics to anticipate economic trends,
evaluate consumer behaviour, and identify factors influencing economic variables such as
GDP, inflation, and unemployment.
2. Finance: Regression analysis is used in portfolio management to estimate risk and return
of investments. It aids in the prediction of stock prices, bond yields, and other financial
measures.
3. Medicine: Regression analysis is used in the medical field to investigate the link
between variables such as dosage and patient response, as well as to predict patient
outcomes based on a variety of criteria.
4. Marketing: Regression analysis is used by marketers to understand the impact of
advertising, pricing, and other marketing initiatives on sales and customer behavior.
5. Environmental Science: Regression analysis is used by researchers to model the link
between environmental parameters (such as temperature and pollution levels) and their
impact on ecosystems.
Importance of Regression Line
The regression line holds immense importance for several reasons:
1. Error Analysis: Regression analysis provides a way to assess the goodness of fit of a
model. By examining residuals (the differences between observed and predicted values),
one can identify patterns and trends in the errors, which further helps in the improvement of
models.
2. Variable Selection: Regression analysis helps in the selection of relevant variables.
While having a large dataset with many potential predictors, regression analysis can
provide guidance in identifying which variables have a significant impact on the outcome,
enabling more efficient and parsimonious models.
3. Quality Control: In manufacturing and quality control processes, regression analysis
can be used to monitor and control product quality. By understanding the relationship
between input variables and product quality, manufacturers can make adjustments to
maintain or improve quality standards.
4. Forecasting: Regression models can be used for time series analysis and forecasting.
This is valuable in industries like retail, where understanding historical sales data can help
in predicting future sales, optimising inventory levels, and planning for seasonal demand.
5. Risk Assessment: In finance and insurance, regression analysis is crucial for assessing
and managing risk. It can help identify factors affecting investment returns, loan defaults,
or insurance claims, aiding in risk assessment and pricing.
6. Policy Evaluation: In social sciences and public policy, regression analysis is employed
to evaluate the impact of policy changes or interventions. By examining the relationship
between policy variables and relevant outcomes, researchers can assess the effectiveness of
different policies and inform decision-makers.
Statistical Significance of Regression Line
In statistical analysis, it is crucial to determine whether the relationship between the
independent and dependent variables is statistically significant. This is usually done using
hypothesis tests and confidence intervals. A small p-value associated with the slope ‘b’
suggests that the relationship is statistically significant.
Applications of Regression Line
1. Predictive Analysis: Used to predict future values based on past data.
2. Trend Analysis: Helps in identifying and analyzing trends over time.
3. Correlation Analysis: Determines the strength and direction of the rela-
tionship between variables.
4. Risk Management: Assists in assessing and managing risks in various
domains like finance and healthcare.
5.
LEAST SQUARE REGRESSION LINE
Given a set of coordinates in the form of (X, Y), the task is to find the least regression line
In statistics, Linear Regression is a linear approach to model the relationship between a
scalar response (or dependent variable), say Y, and one or more explanatory variables (or
independent variables), say X.
Regression Line: If our data shows a linear relationship between X and Y, then the straight
line which best describes the relationship is the regression line. It is the straight line that
covers the maximum points in the graph.
EXAMPLE
Find the least squares regression line for the five-point data set
and verify that it fits the data better than the
line
Solution
In actual practice computation of the regression line is done using a statistical computation
package. In order to clarify the meaning of the formulas we display the computations in tabu -
lar form.
STANDARD ERROR OF ESTIMATE
Learning Objectives
1. Make judgments about the size of the standard error of the estimate from a scatter plot
2. Compute the standard error of the estimate based on errors of prediction
3. Compute the standard error using Pearson's correlation
4. Estimate the standard error of the estimate based on a sample
Figure 1 shows two regression examples. You can see that in Graph A, the points are closer
to the line than they are in Graph B. Therefore, the predictions in Graph A are more accurate
than in Graph B.
Figure 1. Regressions differing in accuracy of prediction.
The standard error of the estimate is a measure of the accuracy of predictions. Recall that the
regression line is the line that minimizes the sum of squared deviations of prediction (also
called the sum of squares error). The standard error of the estimate is closely related to this
quantity and is defined below:
where σest is the standard error of the estimate, Y is an actual score, Y' is a predicted score,
and N is the number of pairs of scores. The numerator is the sum of squared differences
between the actual scores and the predicted scores.
Note the similarity of the formula for σest to the formula for σ.  It turns out that σest is
the standard deviation of the errors of prediction (each Y - Y' is an error of prediction).
Assume the data in Table 1 are the data from a population of five X, Y pairs.
Table 1. Example data.
X Y Y' Y-Y' (Y-Y')2
1.00 1.00 1.210 -0.210 0.044
2.00 2.00 1.635 0.365 0.133
3.00 1.30 2.060 -0.760 0.578
4.00 3.75 2.485 1.265 1.600
5.00 2.25 2.910 -0.660 0.436
Sum 15.00 10.30 10.30 0.000 2.791
The last column shows that the sum of the squared errors of prediction is 2.791. Therefore,
the standard error of the estimate is
There is a version of the formula for the standard error in terms of Pearson's correlation:
where ρ is the population value of Pearson's correlation and SSY is
For the data in Table 1, μy = 2.06, SSY = 4.597 and ρ= 0.6268. Therefore,
Multiple regression analysis is a statistical technique that analyzes the relationship between
two or more variables and uses the information to estimate the value of the dependent
variables. In multiple regression, the objective is to develop a model that describes a
dependent variable y to more than one independent variable.
In linear regression, there is only one independent and dependent variable involved. But, in
the case of multiple regression, there will be a set of independent variables that helps us to
explain better or predict the dependent variable y.
where x1, x2, ….xk are the k independent variables and y is the dependent variable.