0% found this document useful (0 votes)
62 views

Unit II Data Science Notes

Uploaded by

p.brinda
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Unit II Data Science Notes

Uploaded by

p.brinda
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 38

UNIT II DESCRIPTIVE ANALYTICS

Frequency distributions – Outliers –interpreting distributions – graphs – averages -


describing variability – interquartile range – variability for qualitative and ranked data -
Normal distributions – z scores –correlation – scatter plots – regression – regression line –
least squares regression line – standard error of estimate – interpretation of r2 – multiple
regression equations – regression toward the mean

FREQUENCY DISTRIBUTIONS:
Frequency Distribution is a tool in statistics that helps us organize the data and also
helps us reach meaningful conclusions. It tells us how often any specific values occur in
the dataset.
A frequency distribution represents the pattern of how frequently each value of a variable
appears in a dataset. It shows the number of occurrences for each possible value within the
dataset.
Let’s learn about Frequency Distribution including its definition, graphs, solved
examples, and frequency distribution table in detail.

Frequency Distribution Graphs


To represent the Frequency Distribution, there are various methods such as Histogram,
Bar Graph, Frequency Polygon, and Pie Chart.

A brief description of all these graphs is as follows:


Graph Type Description Use Cases

Represents the frequency


of each interval of Continuous data
Histogram
continuous data using bars distribution analysis.
of equal width.

Represents the frequency


of each interval using bars Comparing discrete data
Bar Graph
of equal width; can also categories.
represent discrete data.

Connects midpoints of
class frequencies using Comparing various
Frequency Polygon
lines, similar to a datasets.
histogram but without bars.

Circular graph showing


data as slices of a circle,
Showing relative sizes of
Pie Chart indicating the proportional
data portions.
size of each slice relative to
the whole dataset.

Frequency Distribution Table


A frequency distribution table is a way to organize and present data in a tabular form
which helps us summarize the large dataset into a concise table. In the frequency
distribution table, there are two columns one representing the data either in the form of a
range or an individual data set and the other column shows the frequency of each interval or
individual.
For example, let’s say we have a dataset of students’ test scores in a class.
Test Score Frequency

0-20 6

20-40 12

40-60 22

60-80 15

80-100 5

Types of Frequency Distribution Table


Based on the analysis and categorization of the data, there are two types of Frequency
Distribution Tables i.e.,
 Grouped Frequency Distribution Table
 Ungrouped Frequency Distribution Table
Frequency Distribution Table for Grouped Data
A grouped frequency distribution table is a table that organizes any given data into
intervals or groups, known as class intervals and displays the frequency or number of
observations that fall within each interval.
For example, we can consider the table of the number of cattle owned by families in a
town.
Number of Cattle Number of Families

10 – 20 5

20 – 30 12

30 – 40 8

40 – 50 15

50 – 60 20

In the above table, we can see there are two columns. The first column represents the
number of cattle, and the second column represents the number of families who own
the associate number of cattle. As the first column is grouped with a certain interval
length, thus this table is an example of Grouped Frequency Distribution.
Frequency Distribution Table for Ungrouped Data
An ungrouped frequency distribution table is a statistical table that organizes individual
data values along with their corresponding frequencies instead of groups or class intervals it
mainly works on ungrouped data.
For example, consider the number of vowels in any given paragraph.
Vowel Frequency

a 7

e 10

i 7

o 6

u 3

In the above table, the two columns representing a list of vowels and their frequency in any
given paragraph. As the first column is a list of some individual elements, thus this table is
an example of Ungrouped Frequency Distribution.
Types of Frequency Distribution
There are four types of frequency distribution :
 Grouped Frequency Distribution
 Ungrouped Frequency Distribution
 Relative Frequency Distribution
 Cumulative Frequency Distribution
Grouped Frequency Distribution
In Grouped Frequency Distribution observations are divided between different
intervals known as class intervals and then their frequencies are counted for each class
interval. This Frequency Distribution is used mostly when the data set is very large.
Example: Make the Frequency Distribution Table for the ungrouped data given as
follows:
23, 27, 21, 14, 43, 37, 38, 41, 55, 11, 35, 15, 21, 24, 57, 35, 29, 10, 39, 42, 27, 17, 45, 52,
31, 36, 39, 38, 43, 46, 32, 37, 25
Solution:
As there are observations in between 10 and 57, we can choose class intervals as 10-20, 20-
30, 30-40, 40-50, and 50-60. In these class intervals all the observations are covered and for
each interval there are different frequency which we can count for each interval.
Thus, the Frequency Distribution Table for the given data is as follows:
Class Interval Frequency

10 – 20 5

20 – 30 8

30 – 40 12

40 – 50 6

50 – 60 3

Ungrouped Frequency Distribution


In Ungrouped Frequency Distribution, all distinct observations are mentioned and counted
individually. This Frequency Distribution is often used when the given dataset is small.
Example: Make the Frequency Distribution Table for the ungrouped data given as
follows:
10, 20, 15, 25, 30, 10, 15, 10, 25, 20, 15, 10, 30, 25
Solution:
As unique observations in the given data are only 10, 15, 20, 25, and 30 with each having a
different frequency.
Thus the Frequency Distribution Table of the given data is as follows:
Value Frequency

10 4

15 3

20 2

25 3
Value Frequency

30 2

Relative Frequency Distribution


This distribution displays the proportion or percentage of observations in each interval or
class. It is useful for comparing different data sets or for analyzing the distribution of data
within a set.
Relative Frequency is given by:
Relative Frequency = (Frequency of Event)/(Total Number of Events)
Example: Make the Relative Frequency Distribution Table for the following data:

Score Range 0-20 21-40 41-60 61-80 81-100

Frequency 5 10 20 10 5

Solution:
To Create the Relative Frequency Distribution table, we need to calculate Relative
Frequency for each class interval. Thus Relative Frequency Distribution table is given as
follows:
Score Range Frequency Relative Frequency

0-20 5 5/50 = 0.10

21-40 10 10/50 = 0.20

41-60 20 20/50 = 0.40

61-80 10 10/50 = 0.20

81-100 5 5/50 = 0.10

Total 50 1.00

Cumulative Frequency Distribution


Cumulative frequency is defined as the sum of all the frequencies in the previous
values or intervals up to the current one. The frequency distributions which represent the
frequency distributions using cumulative frequencies are called cumulative frequency
distributions. There are two types of cumulative frequency distributions:
 Less than Type: We sum all the frequencies before the current interval.
 More than Type: We sum all the frequencies after the current interval.
 Example: The table below gives the values of runs scored by Virat Kohli in the
last 25 T-20 matches. Represent the data in the form of less-than-type
cumulative frequency distribution:
45 34 50 75 22

56 63 70 49 33

0 8 14 39 86

92 88 70 56 50

57 45 42 12 39

 Solution:
 Since there are a lot of distinct values, we’ll express this in the form of grouped
distributions with intervals like 0-10, 10-20 and so. First let’s represent the data in
the form of grouped frequency distribution.
Runs Frequency

0-10 2

10-20 2

20-30 1

30-40 4

40-50 4

50-60 5

60-70 1

70-80 3

80-90 2

90-100 1

 Now we will convert this frequency distribution into cumulative frequency


distribution by summing up the values of current interval and all the previous
intervals.
Runs scored by Virat Kohli Cumulative Frequency

Less than 10 2
Runs scored by Virat Kohli Cumulative Frequency

Less than 20 4

Less than 30 5

Less than 40 9

Less than 50 13

Less than 60 18

Less than 70 19

Less than 80 22

Less than 90 24

Less than 100 25

 This table represents the cumulative frequency distribution of less than type.
Runs scored by Virat Kohli Cumulative Frequency

More than 0 25

More than 10 23

More than 20 21

More than 30 20

More than 40 16

More than 50 12

More than 60 7

More than 70 6

More than 80 3
Runs scored by Virat Kohli Cumulative Frequency

More than 90 1

 This table represents the cumulative frequency distribution of more than type.
 We can plot both the type of cumulative frequency distribution to make
the Cumulative Frequency Curve .
 Frequency Distribution Curve
 A frequency distribution curve, also known as a frequency curve, is a graphical
representation of a data set’s frequency distribution. It is used to visualize the
distribution and frequency of values or observations within a dataset.

Frequency Distribution Curve Types

Type of Distribution Description

Symmetric and bell-shaped; data


Normal Distribution
concentrated around the mean.

Not symmetric; can be positively skewed


Skewed Distribution (right-tailed) or negatively skewed (left-
tailed).

Two distinct peaks or modes in the


Bimodal Distribution frequency distribution, suggesting data
from different populations.

More than two distinct peaks or modes in


Multimodal Distribution
the frequency distribution.
Frequency Distribution Curve Types

Type of Distribution Description

All values or intervals have roughly the


Uniform Distribution same frequency, resulting in a flat,
constant distribution.

Rapid drop-off in frequency as values


Exponential Distribution increase, resembling an exponential
function.

Logarithm of the data follows a normal


Log-Normal Distribution distribution, often used for multiplicative
data, positively skewed.

Frequency Distribution Examples


Example 1: Suppose we have a series, with a mean of 20 and a variance is 100. Find
out the Coefficient of Variation.
Solution:
We know the formula for Coefficient of Variation,
𝜎𝑥ˉ×100xˉσ×100
Given mean 𝑥ˉxˉ = 20 and variance 𝜎2σ2 = 100.
Substituting the values in the formula,
𝜎𝑥ˉ×100=20100×100=2010×100=200xˉσ×100=10020×100=1020×100=200
Example 2: Given two series with Coefficients of Variation 70 and 80. The means are
20 and 30. Find the values of standard deviation for both series.
Solution:
In this question we need to apply the formula for CV and substitute the given values.
Standard Deviation of first series.
𝐶.𝑉=𝜎𝑥ˉ×10070=𝜎20×1001400=𝜎×10014=𝜎 C.V=xˉσ×10070=20σ
×1001400=σ×10014=σ
Thus, the standard deviation of first series = 14
Standard Deviation of second series.
𝐶.𝑉=𝜎𝑥ˉ×10080=𝜎30×1002400=𝜎×10024=𝜎 C.V=xˉσ×10080=30σ
×1002400=σ×10024=σ
Thus, the standard deviation of first series = 24
Example 3: Draw the frequency distribution table for the following data:
2, 3, 1, 4, 2, 2, 3, 1, 4, 4, 4, 2, 2, 2
Solution:
Since there are only very few distinct values in the series, we will plot the ungrouped
frequency distribution.

Value Frequency

1 2

2 6
Value Frequency

3 2

4 4

Total 14

Example 4: The table below gives the values of temperature recorded in Hyderabad
for 25 days in summer. Represent the data in the form of less-than-type cumulative
frequency distribution:

37 34 36 27 22

25 25 24 26 28

30 31 29 28 30

32 31 28 27 30

30 32 35 34 29

Solution:
Since there are so many distinct values here, we will use grouped frequency distribution.
Let’s say the intervals are 20-25, 25-30, 30-35. Frequency distribution table can be made
by counting the number of values lying in these intervals.
Temperature Number of Days

20-25 2

25-30 10

30-35 13

This is the grouped frequency distribution table. It can be converted into cumulative
frequency distribution by adding the previous values.
Temperature Number of Days

Less than 25 2

Less than 30 12

Less than 35 25

Example 5: Make a Frequency Distribution Table as well as the curve for the data:
{45, 22, 37, 18, 56, 33, 42, 29, 51, 27, 39, 14, 61, 19, 44, 25, 58, 36, 48, 30, 53, 41, 28, 35,
47, 21, 32, 49, 16, 52, 26, 38, 57, 31, 59, 20, 43, 24, 55, 17, 50, 23, 34, 60, 46, 13, 40, 54,
15, 62}
Solution:
To create the frequency distribution table for given data, let’s arrange the data in
ascending order as follows:
{13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62}
Now, we can count the observations for intervals: 10-20, 20-30, 30-40, 40-50, 50-60 and
60-70.
Interval Frequency

10 – 20 7

20 – 30 10

30 – 40 10

40 – 50 10

50 – 60 10

60 – 70 3

From this data, we can plot the Frequency Distribution Curve as follows:

OUTLIERS
Outliers are extreme values that differ from most other data points in a dataset. They can
have a big impact on your statistical analyses and skew the results of any hypothesis tests.
It’s important to carefully identify potential outliers in your dataset and deal with them in an
appropriate manner for accurate results.
There are four ways to identify outliers:
1. Sorting method

2. Data visualization method

3. Statistical tests (z scores)


4. Interquartile range method

Outliers are values at the extreme ends of a dataset.


Some outliers represent true values from natural variation in the population. Other
outliers may result from incorrect data entry, equipment malfunctions, or
other measurement errors.
An outlier isn’t always a form of dirty or incorrect data, so you have to be careful with
them in data cleansing. What you should do with an outlier depends on its most likely
cause.

Four ways of calculating outliers


You can choose from several methods to detect outliers depending on your time and
resources.

Sorting method
You can sort quantitative variables from low to high and scan for extremely low or extremely
high values. Flag any extreme values that you find.
This is a simple way to check whether you need to investigate certain data points before using
more sophisticated methods.
Example: Sorting methodYour dataset for a pilot experiment consists of 8 values.
180 156 9 176 163 1827 166 171

You sort the values from low to high and scan for extreme values.
9 156 163 166 171 176 180 1872

Using visualizations
You can use software to visualize your data with a box plot, or a box-and-whisker plot, so
you can see the data distribution at a glance. This type of chart highlights minimum and
maximum values (the range), the median, and the interquartile range for your data.
Many computer programs highlight an outlier on a chart with an asterisk, and these will lie
outside the bounds of the graph.

Statistical outlier detection


Statistical outlier detection involves applying statistical tests or procedures to identify
extreme values.
You can convert extreme data points into z scores that tell you how many standard deviations
away they are from the mean.
If a value has a high enough or low enough z score, it can be considered an outlier. As a rule
of thumb, values with a z score greater than 3 or less than –3 are often determined to be
outliers.
Using the interquartile range
The interquartile range (IQR) tells you the range of the middle half of your dataset. You
can use the IQR to create “fences” around your data and then define outliers as any values
that fall outside those fences.

This method is helpful if you have a few values on the extreme ends of your dataset, but you
aren’t sure whether any of them might count as outliers.
Interquartile range method

1. Sort your data from low to high


2. Identify the first quartile (Q1), the median, and the third quartile (Q3).

3. Calculate your IQR = Q3 – Q1


4. Calculate your upper fence = Q3 + (1.5 * IQR)
5. Calculate your lower fence = Q1 – (1.5 * IQR)
6. Use your fences to highlight any outliers, all values that fall outside your fences.
Your outliers are any values greater than your upper fence or less than your lower fence.

Example: Using the interquartile range to find outliers


We’ll walk you through the popular IQR method for identifying outliers using a step-by-step
example.
Your dataset has 11 values. You have a couple of extreme values in your dataset, so you’ll
use the IQR method to check whether they are outliers.
26 37 24 28 35 22 31 53 41 64 29

Step 1: Sort your data from low to high


First, you’ll simply sort your data in ascending order.
22 24 26 28 29 31 35 37 41 53 64
Step 2: Identify the median, the first quartile (Q1), and the third quartile (Q3)
The median is the value exactly in the middle of your dataset when all values are ordered
from low to high.
Since you have 11 values, the median is the 6th value. The median value is 31.
22 24 26 28 29 31 35 37 41 53 64

Next, we’ll use the exclusive method for identifying Q1 and Q3. This means we remove the
median from our calculations.
The Q1 is the value in the middle of the first half of your dataset, excluding the median. The
first quartile value is 25.
22 24 26 28 29

Your Q3 value is in the middle of the second half of your dataset, excluding the median. The
third quartile value is 41.
35 37 41 53 64

Step 3: Calculate your IQR


The IQR is the range of the middle half of your dataset. Subtract Q1 from Q3 to calculate the
IQR.
Formula Calculation

IQR = Q3 – Q1 Q1 = 26
Q3 = 41
IQR = 41 – 26
= 15

Step 4: Calculate your upper fence


The upper fence is the boundary around the third quartile. It tells you that any values
exceeding the upper fence are outliers.
Formula Calculation

Upper fence = Q3 + (1.5 * IQR) Upper fence = 41 + (1.5 * 15)


= 41 + 22.5
= 63.5

Step 5: Calculate your lower fence


The lower fence is the boundary around the first quartile. Any values less than the lower
fence are outliers.
Formula Calculation

Lower fence = Q1 – (1.5 * IQR) Lower fence = 26 – (1.5 * IQR)


= 26 – 22.5
= 3.5

Step 6: Use your fences to highlight any outliers


Go back to your sorted dataset from Step 1 and highlight any values that are greater than the
upper fence or less than your lower fence. These are your outliers.

 Upper fence = 63.5


 Lower fence = 3.5
22 24 26 28 29 31 35 37 41 53 64

You find one outlier, 64, in your dataset.

Dealing with outliers


Once you’ve identified outliers, you’ll decide what to do with them. Your main options are
retaining or removing them from your dataset. This is similar to the choice you’re faced with
when dealing with missing data.
For each outlier, think about whether it’s a true value or an error before deciding.

 Does the outlier line up with other measurements taken from the same participant?
 Is this data point completely impossible or can it reasonably come from
your population?
 What’s the most likely source of the outlier? Is it a natural variation or an error?
In general, you should try to accept outliers as much as possible unless it’s clear that they
represent errors or bad data.

Retain outliers
Just like with missing values, the most conservative option is to keep outliers in your dataset.
Keeping outliers is usually the better option when you’re not sure if they are errors.
With a large sample, outliers are expected and more likely to occur. But each outlier has less
of an effect on your results when your sample is large enough. The central
tendency and variability of your data won’t be as affected by a couple of extreme values
when you have a large number of values.
If you have a small dataset, you may also want to retain as much data as possible to make
sure you have enough statistical power. If your dataset ends up containing many outliers, you
may need to use a statistical test that’s more robust to them. Non-parametric statistical tests
perform better for these data.
Remove outliers
Outlier removal means deleting extreme values from your dataset before you
perform statistical analyses. You aim to delete any dirty data while retaining true extreme
values.
It’s a tricky procedure because it’s often impossible to tell the two types apart for sure.
Deleting true outliers may lead to a biased dataset and an inaccurate conclusion.
For this reason, you should only remove outliers if you have legitimate reasons for doing so.
It’s important to document each outlier you remove and your reasons so that other researchers
can follow your procedures.
VARIABILITY FOR QUALITATIVE AND RANKED DATA
A measure of variability is a value that indicates how varied, or spread out, a data set is. The
simplest measure of variability is the range. It measures the variation of the data by finding
the difference in the maximum data value and the minimum data value. The obvious flaw
with this type of measurement is that it only takes the most extreme data values into account
and is therefore very sensitive to outliers. As we will see, other measures like the interquartile
range, standard deviation and variance use the entire data set, so they are not as sensitive.
Formulas for Measures of Variability

 Range = maximum data value - minimum data value


 Interquartile Range = Quartile 3 - Quartile 1

 Variance = 𝜎2=∑𝑖=1𝑛(𝑥𝑖−𝜇)2𝑛

 Standard Deviation = 𝜎=∑𝑖=1𝑛(𝑥𝑖−𝜇)2𝑛


Where 𝑛 is the number of values in the data set 𝑥1,𝑥2,...𝑥𝑛 and 𝜇 is the mean of that data set.
Finding the Quartiles of a Data Set
To find the quartiles of a data set, first find the median. The median is also quartile 2 (Q2).
Then split the ranked data into two halves: the values to the left of Q2 is the lower half and
and the values to the right of Q2 is the upper half. Note that Q2 is not part of either half and
each half should have the same number of values. You will see an example of this below.
Example of Calculating Range and Interquartile Range (IQR)
The height, in inches, of 10 students is measured and the results are shown below. Calculate
the range and interquartile range for the height of these students.
70, 74, 62, 68, 65, 70, 69, 63, 67, 66
The maximum data value is 74 and the minimum value is 62. Therefore the range is 74 - 62 =
12.
To find the IQR, we will first have to rank the data. This is done below.
62, 63, 65, 66, 67, 68, 69, 70, 70, 74
The median is between 67 and 68. That is (67+68)/2 = 67.5.
The lower half of the data is 62, 63, 65, 66, 67.
The median of the lower half is 65 and this is Quartile 1.
The upper half of the data is 68, 69, 70, 70, 74
The median of the upper half is 70 and this is Quartile 3.
Therefore the Interquartile Range is 70 - 65 = 5
Example of Calculating Standard Deviation and Variance
The height, in inches, of 10 students is measured and the results are shown below. Calculate
the standard deviation and variance for the height of these students.
70, 74, 62, 68, 65, 70, 69, 63, 67, 66
We will start by calculating the variance. To keep our calculations organized, we will use a
table to find the numerator from the formula.
First we must calculate the mean.
𝜇=70+74+62+68+65+70+69+63+67+6610=67.4

x x-67.4 (x-67.4)^2
70 2.6 6.76
74 6.6 43.56
62 -5.4 29.16
68 0.6 0.36
65 -2.4 5.76
70 2.6 6.76
69 1.6 2.56
63 -4.4 19.36
67 -0.4 0.16
66 -1.4 1.96
If we sum the rightmost column, we will find ∑𝑖=110(𝑥𝑖−𝜇)2=116.4
Thus, the variance is 𝜎2=116.410=11.64.
Notice that the standard deviation is simply the square root of the variance.
Thus, the standard deviation is 𝜎=11.64=3.412.

NORMAL DISTRIBUTIONS
In a normal distribution, data is symmetrically distributed with no skew. When plotted on a
graph, the data follows a bell shape, with most values clustering around a central region and
tapering off as they go further away from the center.
Normal distributions are also called Gaussian distributions or bell curves because of their
shape. In a normal distribution, data is symmetrically distributed with no skew. When plotted
on a graph, the data follows a bell shape, with most values clustering around a central
region and tapering off as they go further away from the center.
Normal distributions are also called Gaussian distributions or bell curves because of their
shape.

Normal distributions have key characteristics that are easy to spot in graphs:
 The mean, median and mode are exactly the same.
 The distribution is symmetric about the mean—half the values fall below the mean
and half above the mean.
 The distribution can be described by two values: the mean and the standard deviation.

The mean is the location parameter while the standard deviation is the scale parameter.
The mean determines where the peak of the curve is centered. Increasing the mean moves the
curve right, while decreasing it moves the curve left.
The standard deviation stretches or squeezes the curve. A small standard deviation results in a
narrow curve, while a large standard deviation leads to a wide curve.

Empirical rule
The empirical rule, or the 68-95-99.7 rule, tells you where most of your values lie in a
normal distribution:

 Around 68% of values are within 1 standard deviation from the mean.
 Around 95% of values are within 2 standard deviations from the mean.
 Around 99.7% of values are within 3 standard deviations from the mean.

Example: Using the empirical rule in a normal distributionYou collect SAT scores from
students in a new test preparation course. The data follows a normal distribution with a mean
score (M) of 1150 and a standard deviation (SD) of 150.
Following the empirical rule:
 Around 68% of scores are between 1,000 and 1,300, 1 standard deviation above and
below the mean.
 Around 95% of scores are between 850 and 1,450, 2 standard deviations above and
below the mean.
 Around 99.7% of scores are between 700 and 1,600, 3 standard deviations above and
below the mean.

The empirical rule is a quick way to get an overview of your data and check for any outliers
or extreme values that don’t follow this pattern.
If data from small samples do not closely follow this pattern, then other distributions like
the t-distribution may be more appropriate. Once you identify the distribution of your
variable, you can apply appropriate statistical tests.
Once you have the mean and standard deviation of a normal distribution, you can fit a normal
curve to your data using a probability density function.
In a probability density function, the area under the curve tells you probability. The normal
distribution is a probability distribution, so the total area under the curve is always 1 or
100%.
The formula for the normal probability density function looks fairly complicated. But to use
it, you only need to know the population mean and standard deviation.
For any value of x, you can plug in the mean and standard deviation into the formula to find
the probability density of the variable taking on that value of x.
Normal probability density formula Explanation

 f(x) = probability
 x = value of the variable
 μ = mean
 σ = standard deviation
 σ2 = variance

standard normal distribution


The standard normal distribution, also called the z-distribution, is a special normal
distribution where the mean is 0 and the standard deviation is 1.
Every normal distribution is a version of the standard normal distribution that’s been
stretched or squeezed and moved horizontally right or left.

While individual observations from normal distributions are referred to as x, they are referred
to as z in the z-distribution. Every normal distribution can be converted to the standard
normal distribution by turning the individual values into z-scores.
Z-scores tell you how many standard deviations away from the mean each value lies.
You only need to know the mean and standard deviation of your distribution to find the z-
score of a value.
Z-score Explanation
Formula

 x = individual value
 μ = mean
 σ = standard deviation
We convert normal distributions into the standard normal distribution for several reasons:

 To find the probability of observations in a distribution falling above or below a given


value.
 To find the probability that a sample mean significantly differs from a known
population mean.
 To compare scores on different distributions with different means and standard
deviations.

Finding probability using the z-distribution


Each z-score is associated with a probability, or p-value, that tells you the likelihood of
values below that z-score occurring. If you convert an individual value into a z-score, you can
then find the probability of all values up to that value occurring in a normal distribution.
Example: Finding probability using the z-distributionTo find the probability of SAT scores in
your sample exceeding 1380, you first find the z-score.
The mean of our distribution is 1150, and the standard deviation is 150. The z-score tells you
how many standard deviations away 1380 is from the mean.
Formula Calculation

For a z-score of 1.53, the p-value is 0.937. This is the probability of SAT scores being 1380
or less (93.7%), and it’s the area under the curve left of the shaded area.

To find the shaded area, you take away 0.937 from 1, which is the total area under the curve.
Probability of x > 1380 = 1 – 0.937 = 0.063
That means it is likely that only 6.3% of SAT scores in your sample exceed 1380.

CORRELATION
Correlation analysis is a statistical technique for determining the strength of a link between
two variables. It is used to detect patterns and trends in data and to forecast future
occurrences.
 Consider a problem with different factors to be considered for making
optimal conclusions
 Correlation explains how these variables are dependent on each other.
 Correlation quantifies how strong the relationship between two variables
is. A higher value of the correlation coefficient implies a stronger association.
 The sign of the correlation coefficient indicates the direction of the
relationship between variables. It can be either positive, negative, or zero.
The Pearson correlation coefficient is the most often used metric of correlation. It expresses
the linear relationship between two variables in numerical terms. The Pearson correlation
coefficient, written as “r,” is as follows:
𝑟=∑(𝑥𝑖−𝑥ˉ)(𝑦𝑖−𝑦ˉ)∑(𝑥𝑖−𝑥ˉ)2∑(𝑦𝑖−𝑦ˉ)2r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
where,
 r: Correlation coefficient
 𝑥𝑖xi : i^th value first dataset X
 𝑥ˉxˉ : Mean of first dataset X
 𝑦𝑖yi : i^th value second dataset Y
 𝑦ˉyˉ : Mean of second dataset Y
The correlation coefficient , denoted by “r”, ranges between -1 and 1.
r = -1 indicates a perfect negative correlation.
r = 0 indicates no linear correlation between the variables.
r = 1 indicates a perfect positive correlation.
Types of Correlation
There are three types of correlation:

1. Positive Correlation: Positive correlation indicates that two variables


have a direct relationship. As one variable increases, the other variable also
increases. For example, there is a positive correlation between height and
weight. As people get taller, they also tend to weigh more.
2. Negative Correlation: Negative correlation indicates that two variables
have an inverse relationship. As one variable increases, the other variable
decreases. For example, there is a negative correlation between price and
demand. As the price of a product increases, the demand for that product
decreases.
3. Zero Correlation: Zero correlation indicates that there is no relationship
between two variables. The changes in one variable do not affect the other
variable. For example, there is zero correlation between shoe size and
intelligence.
A positive correlation indicates that the two variables move in the same direction, while a
negative correlation indicates that the two variables move in opposite directions.
The strength of the correlation is measured by a correlation coefficient, which can range
from -1 to 1. A correlation coefficient of 0 indicates no correlation, while a correlation
coefficient of 1 or -1 indicates a perfect correlation.
Implementations
Python provides libraries such as “NumPy” and “Pandas” which have various methods to
ease various calculations, including correlation analysis.
Using NumPy
Python3
import numpy as np

# Create sample data


x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 7, 3, 9, 1])

# Calculate correlation coefficient


correlation_coefficient = np.corrcoef(x, y)

print("Correlation Coefficient:", correlation_coefficient)


Output:
Correlation Coefficient: [[ 1. -0.3]
[-0.3 1. ]]
Using pandas
Python3
import pandas as pd

# Create a DataFrame with sample data


data = pd.DataFrame({'X': [1, 2, 3, 4, 5], 'Y': [5, 7, 3, 9, 1]})

# Calculate correlation coefficient


correlation_coefficient = data['X'].corr(data['Y'])

print("Correlation Coefficient:", correlation_coefficient)


Output:
Correlation Coefficient: -0.3

SCATTER PLOTS
A scatter plot, also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter
diagram, is a type of plot or mathematical diagram using Cartesian coordinates to display
values for typically two variables for a set of data.

A scatter plot is a diagram where each value in the data set is represented by a dot.

The Matplotlib module has a method for drawing scatter plots, it needs two arrays of the
same length, one for the values of the x-axis, and one for the values of the y-axis:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]

y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

The x array represents the age of each car.

The y array represents the speed of each car.

Example

Use the scatter() method to draw a scatter plot diagram:

import sys
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
#Two lines to make our compiler able to draw:
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
Result:

REGRESSION
Regression is a statistical approach used to analyze the relationship between a dependent
variable (target variable) and one or more independent variables (predictor variables). The
objective is to determine the most suitable function that characterizes the connection
between these variables.
It seeks to find the best-fitting model, which can be utilized to make predictions or draw
conclusions.
Regression in Machine Learning
It is a supervised machine learning technique, used to predict the value of the dependent
variable for new, unseen data. It models the relationship between the input features and the
target variable, allowing for the estimation or prediction of numerical values.
Regression analysis problem works with if output variable is a real or continuous value,
such as “salary” or “weight”. Many different models can be used, the simplest is the linear
regression. It tries to fit data with the best hyper-plane which goes through the points.
erminologies Related to the Regression Analysis in Machine Learning
Terminologies Related to Regression Analysis:
 Response Variable: The primary factor to predict or understand in
regression, also known as the dependent variable or target variable.
 Predictor Variable: Factors influencing the response variable, used to
predict its values; also called independent variables.
 Outliers: Observations with significantly low or high values compared to
others, potentially impacting results and best avoided.
 Multicollinearity: High correlation among independent variables, which
can complicate the ranking of influential variables.
 Underfitting and Overfitting : Overfitting occurs when an algorithm
performs well on training but poorly on testing, while underfitting indicates poor
performance on both datasets.
Regression Types
There are two main types of regression:
 Simple Regression
o Used to predict a continuous dependent variable based
on a single independent variable.
o Simple linear regression should be used when there is
only a single independent variable.
 Multiple Regression
o Used to predict a continuous dependent variable based
on multiple independent variables.
o Multiple linear regression should be used when there
are multiple independent variables.
 NonLinear Regression
o Relationship between the dependent variable and
independent variable(s) follows a nonlinear pattern.
o Provides flexibility in modeling a wide range of
functional forms.
Regression Algorithms
There are many different types of regression algorithms, but some of the most common
include:
 Linear Regression
Linear regression is one of the simplest and most widely used statistical models. This
assumes that there is a linear relationship between the independent and dependent variables.
This means that the change in the dependent variable is proportional to the change in the
independent variables.
 Polynomial Regression
Polynomial regression is used to model nonlinear relationships between the dependent
variable and the independent variables. It adds polynomial terms to the linear regression
model to capture more complex relationships.
 Support Vector Regression (SVR)
Support vector regression (SVR) is a type of regression algorithm that is based on the
support vector machine (SVM) algorithm. SVM is a type of algorithm that is used for
classification tasks, but it can also be used for regression tasks. SVR works by finding a
hyperplane that minimizes the sum of the squared residuals between the predicted and
actual values.
 Decision Tree Regression
Decision tree regression is a type of regression algorithm that builds a decision tree to
predict the target value. A decision tree is a tree-like structure that consists of nodes and
branches. Each node represents a decision, and each branch represents the outcome of that
decision. The goal of decision tree regression is to build a tree that can accurately predict
the target value for new data points.
 Random Forest Regression
Random forest regression is an ensemble method that combines multiple decision trees to
predict the target value. Ensemble methods are a type of machine learning algorithm that
combines multiple models to improve the performance of the overall model. Random forest
regression works by building a large number of decision trees, each of which is trained on a
different subset of the training data. The final prediction is made by averaging the
predictions of all of the trees.

Regression Model Machine Learning


Let’s take an example of linear regression. We have a Housing data set and we want to
predict the price of the house. Following is the python code for it.
# Python code to illustrate
# regression using data set
import matplotlib
matplotlib.use('GTKAgg')

import matplotlib.pyplot as plt


import numpy as np
from sklearn import datasets, linear_model
import pandas as pd

# Load CSV and columns


df = pd.read_csv("Housing.csv")

Y = df['price']
X = df['lotsize']

X=X.values.reshape(len(X),1)
Y=Y.values.reshape(len(Y),1)

# Split the data into training/testing sets


X_train = X[:-250]
X_test = X[-250:]

# Split the targets into training/testing sets


Y_train = Y[:-250]
Y_test = Y[-250:]

# Plot outputs
plt.scatter(X_test, Y_test, color='black')
plt.title('Test Data')
plt.xlabel('Size')
plt.ylabel('Price')
plt.xticks(())
plt.yticks(())

# Create linear regression object


regr = linear_model.LinearRegression()

# Train the model using the training sets


regr.fit(X_train, Y_train)

# Plot outputs
plt.plot(X_test, regr.predict(X_test), color='red',linewidth=3)
plt.show()

OUTPUT

REGRESSION LINE
Regression Line is defined as a statistical concept that facilitates and predicts the
relationship between two or more variables. A regression line is a straight line that reflects
the best-fit connection in a dataset between independent and dependent variables. The
independent variable is generally shown on the X-axis and the dependent variable is shown
on the Y-axis. The main purpose of developing a regression line is to predict or estimate the
value of the dependent variable based on the values of one or more independent variables.
Equation of Regression Line
The equation of a simple linear regression line is given by:
Y = a + bX + ε
 Y is the dependent variable
 X is the independent variable
 a is the y-intercept, which represents the value of Y when X is 0.
 b is the slope, which represents the change in Y for a unit change in X
 ε is residual error.
Examples of Regression Line
Example 1:
A function facilitates the calculation of marks scored by the students when the number of
hours studied by them is given. The slope and y-intercept of the given function are 5 and 50
respectively. Using this information, form a regression line equation.
Solution:
In case of calculation of marks scored by students, when the numbers of hours each of them
studied are given, Marks will be the dependent variable (i.e. marks will be represented by
Y) and number of hours studied will be the dependant variable (i.e. number of hours
studied by the students will be represented by X). Now, the general linear regression
equation is Y = a + bX.
We have been given that the y-intercept is 50, (i.e., a = 50) and the respective slope is 5,
(i.e. b = 5).
Therefore, the required equation of regression line will be,
Y = 50 + 5X + ε
Example 2:
In continuation with the above example, the figures of three students are given as follows:
Student 1: Studied for 2 hours and scored 60 marks.
Student 2: Studied for 3 hours and scored 65 marks.
What will the marks scored by the 4th student in case he/she studies for 5 hours.
Solution:
The required equation of regression line as calculated in previous example is,
Y = 50 + 5X
In case of 4th student, who studies for 5 hours (X = 5), the marks scored by him will be
calculated as,
Y = 50 + 5X.
Y = 50 + 5(5)
Y = 75 Marks
Types of Regression Lines
1. Linear Regression Line: Linear regression line is utilised when there is a linear
relationship between the reliant variable and at least one free variables. The condition of a
straightforward linear relapse line is typically; Y = a + bX + ε, where Y is the reliant
variable, X is the free variable, a is the y-intercept, b is the slope, and ε is error.
2. Logistic Regression Line: Logistic regression is used when the dependent variable is
discrete. It models the probability of a binary outcome using a logistic function. The
equation is typically expressed as the log-odds of the probability.
3. Polynomial Regression Line: Polynomial regression is used when the relationship
between the dependent and independent variables is best represented by a polynomial
equation. The equation is Y = aX2 + bX + c, or even higher-order polynomial equations.
4. Ridge and Lasso Regression: These are used for regularisation in linear regression.
Ridge and Lasso add penalty terms to the linear regression equation to prevent overfitting
and perform feature selection.
5. Non-Linear Regression Line: For situations where the relationships between variables
is not linear, non-linear regression lines must be used to defined the relationship.
6. Multiple Regression Line: This involves multiple independant variables to predict a
dependant variable. It is an extension of linear regression.
7. Exponential Regression Line: Exponential Regression Line is formed when the data
follows an exponential growth or decay pattern. It is often seen in fields like biology,
finance, and physics.
8. Pricewise Regression Line: In this approach, the data is divided into segments, and a
different linear or no linear model is applied to each segment.
9. Time Series Regression Line: This approach is used to deal with time-series data, and
models how the dependent variable changes over time.
10. Power Regression Line: This type of regression line is used when one variable
increases at a power of another. It can be applied to situations where exponential growth
does not fit.
Applications of Regression Line
Regression lines have numerous uses in a variety of domains, including:
1. Economics: Regression analysis is used in economics to anticipate economic trends,
evaluate consumer behaviour, and identify factors influencing economic variables such as
GDP, inflation, and unemployment.
2. Finance: Regression analysis is used in portfolio management to estimate risk and return
of investments. It aids in the prediction of stock prices, bond yields, and other financial
measures.
3. Medicine: Regression analysis is used in the medical field to investigate the link
between variables such as dosage and patient response, as well as to predict patient
outcomes based on a variety of criteria.
4. Marketing: Regression analysis is used by marketers to understand the impact of
advertising, pricing, and other marketing initiatives on sales and customer behavior.
5. Environmental Science: Regression analysis is used by researchers to model the link
between environmental parameters (such as temperature and pollution levels) and their
impact on ecosystems.
Importance of Regression Line
The regression line holds immense importance for several reasons:
1. Error Analysis: Regression analysis provides a way to assess the goodness of fit of a
model. By examining residuals (the differences between observed and predicted values),
one can identify patterns and trends in the errors, which further helps in the improvement of
models.
2. Variable Selection: Regression analysis helps in the selection of relevant variables.
While having a large dataset with many potential predictors, regression analysis can
provide guidance in identifying which variables have a significant impact on the outcome,
enabling more efficient and parsimonious models.
3. Quality Control: In manufacturing and quality control processes, regression analysis
can be used to monitor and control product quality. By understanding the relationship
between input variables and product quality, manufacturers can make adjustments to
maintain or improve quality standards.
4. Forecasting: Regression models can be used for time series analysis and forecasting.
This is valuable in industries like retail, where understanding historical sales data can help
in predicting future sales, optimising inventory levels, and planning for seasonal demand.
5. Risk Assessment: In finance and insurance, regression analysis is crucial for assessing
and managing risk. It can help identify factors affecting investment returns, loan defaults,
or insurance claims, aiding in risk assessment and pricing.
6. Policy Evaluation: In social sciences and public policy, regression analysis is employed
to evaluate the impact of policy changes or interventions. By examining the relationship
between policy variables and relevant outcomes, researchers can assess the effectiveness of
different policies and inform decision-makers.
Statistical Significance of Regression Line
In statistical analysis, it is crucial to determine whether the relationship between the
independent and dependent variables is statistically significant. This is usually done using
hypothesis tests and confidence intervals. A small p-value associated with the slope ‘b’
suggests that the relationship is statistically significant.
Applications of Regression Line
1. Predictive Analysis: Used to predict future values based on past data.
2. Trend Analysis: Helps in identifying and analyzing trends over time.
3. Correlation Analysis: Determines the strength and direction of the rela-
tionship between variables.
4. Risk Management: Assists in assessing and managing risks in various
domains like finance and healthcare.
5.
LEAST SQUARE REGRESSION LINE
Given a set of coordinates in the form of (X, Y), the task is to find the least regression line
In statistics, Linear Regression is a linear approach to model the relationship between a
scalar response (or dependent variable), say Y, and one or more explanatory variables (or
independent variables), say X.
Regression Line: If our data shows a linear relationship between X and Y, then the straight
line which best describes the relationship is the regression line. It is the straight line that
covers the maximum points in the graph.

EXAMPLE
Find the least squares regression line for the five-point data set
and verify that it fits the data better than the

line

Solution

In actual practice computation of the regression line is done using a statistical computation
package. In order to clarify the meaning of the formulas we display the computations in tabu -
lar form.
STANDARD ERROR OF ESTIMATE
Learning Objectives
1. Make judgments about the size of the standard error of the estimate from a scatter plot
2. Compute the standard error of the estimate based on errors of prediction
3. Compute the standard error using Pearson's correlation
4. Estimate the standard error of the estimate based on a sample
Figure 1 shows two regression examples. You can see that in Graph A, the points are closer
to the line than they are in Graph B. Therefore, the predictions in Graph A are more accurate
than in Graph B.
Figure 1. Regressions differing in accuracy of prediction.
The standard error of the estimate is a measure of the accuracy of predictions. Recall that the
regression line is the line that minimizes the sum of squared deviations of prediction (also
called the sum of squares error). The standard error of the estimate is closely related to this
quantity and is defined below:

where σest is the standard error of the estimate, Y is an actual score, Y' is a predicted score,
and N is the number of pairs of scores. The numerator is the sum of squared differences
between the actual scores and the predicted scores.
Note the similarity of the formula for σest to the formula for σ.  It turns out that σest is
the standard deviation of the errors of prediction (each Y - Y' is an error of prediction).
Assume the data in Table 1 are the data from a population of five X, Y pairs.
Table 1. Example data.
X Y Y' Y-Y' (Y-Y')2
1.00 1.00 1.210 -0.210 0.044
2.00 2.00 1.635 0.365 0.133
3.00 1.30 2.060 -0.760 0.578
4.00 3.75 2.485 1.265 1.600
5.00 2.25 2.910 -0.660 0.436
Sum 15.00 10.30 10.30 0.000 2.791
The last column shows that the sum of the squared errors of prediction is 2.791. Therefore,
the standard error of the estimate is

There is a version of the formula for the standard error in terms of Pearson's correlation:
where ρ is the population value of Pearson's correlation and SSY is

For the data in Table 1, μy = 2.06, SSY = 4.597 and ρ= 0.6268. Therefore,

which is the same value computed previously.


Similar formulas are used when the standard error of the estimate is computed from a
sample rather than a population. The only difference is that the denominator is N-2 rather
than N. The reason N-2 is used rather than N-1 is that two parameters (the slope and the inter-
cept) were estimated in order to estimate the sum of squares. Formulas for a sample compar-
able to the ones for a population are shown below.

MULTIPLE REGRESSION DEFINITION

Multiple regression analysis is a statistical technique that analyzes the relationship between
two or more variables and uses the information to estimate the value of the dependent
variables. In multiple regression, the objective is to develop a model that describes a
dependent variable y to more than one independent variable.

MULTIPLE REGRESSION FORMULA

In linear regression, there is only one independent and dependent variable involved. But, in
the case of multiple regression, there will be a set of independent variables that helps us to
explain better or predict the dependent variable y.

The multiple regression equation is given by


y = a + b 1×1+ b2×2+……+ bkxk

where x1, x2, ….xk are the k independent variables and y is the dependent variable.

You might also like