0% found this document useful (0 votes)
10 views

FDSA unit 2

Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

FDSA unit 2

Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 44

UNIT- II

PROCESS MANAGEMENT
Normal distributions – z scores – normal curve problems – finding proportions – finding scores –
more about z scores – correlation – scatter plots – correlation coefficient for quantitative data –
computational formula for correlation coefficient – regression – regression line – least squares
regression line – standard error of estimate – interpretation of r2 – multiple regression equations
– regression toward the mean .

2.1 NORMAL DISTRIBUTIONS


The Normal Distribution is defined by the probability density function for a continuous
random variable in a system. Let us say, f(x) is the probability density function and X is the
random variable. Hence, it defines a function which is integrated between the range or interval (x
to x + dx), giving the probability of random variable X, by considering the values between x and
x+dx.

f(x) ≥ 0 ∀ x ϵ (−∞,+∞)

And -∞∫+∞ f(x) = 1

2.1.1 Normal Distribution Formula

The probability density function of normal or Gaussian distribution is given by;

Where,

 x is the variable
 μ is the mean
 σ is the standard deviation

2.1.2 Normal Distribution Curve

The random variables following the normal distribution are those whose values can find any
unknown value in a given range. For example, finding the height of the students in the school. Here, the
distribution can consider any value, but it will be bounded in the range say, 0 to 6ft. This limitation is
forced physically in our query.
Whereas, the normal distribution doesn’t even bother about the range. The range can also extend to –∞
to + ∞ and still we can find a smooth curve. These random variables are called Continuous Variables, and
the Normal Distribution then provides here probability of the value lying in a particular range for a given
experiment.

2.1.3 Normal Distribution Standard Deviation

Generally, the normal distribution has any positive standard deviation. We know that the mean helps to
determine the line of symmetry of a graph, whereas the standard deviation helps to know how far the data
are spread out. If the standard deviation is smaller, the data are somewhat close to each other and the
graph becomes narrower. If the standard deviation is larger, the data are dispersed more, and the graph
becomes wider. The standard deviations are used to subdivide the area under the normal curve. Each
subdivided section defines the percentage of data, which falls into the specific region of a graph.

Using 1 standard deviation, the Empirical Rule states that,

 Approximately 68% of the data falls within one standard deviation of the mean. (i.e., Between
Mean- one Standard Deviation and Mean + one standard deviation)

 Approximately 95% of the data falls within two standard deviations of the mean. (i.e., Between
Mean- two Standard Deviation and Mean + two standard deviations)
 Approximately 99.7% of the data fall within three standard deviations of the mean. (i.e., Between
Mean- three Standard Deviation and Mean + three standard deviations)

2.5 MEAN MEDIAN MODE

 The Mean, Median and Mode are the three measures of central tendency.

 Mean is the arithmetic average of a data set.


 This is found by adding the numbers in a data set and dividing by the number of observations in
the data set.
 The median is the middle number in a data set when the numbers are listed in either
ascending or descending order.
 The mode is the value that occurs the most often in a data set and the range is the difference
between the highest and lowest values in a data set.T

2.5.1 Mean

 Mean is the arithmetic average of a data set.

 This is found by adding the numbers in a data set and dividing by the number of observations in
the data set.

2.5.2 Median

 The median is the middle number in a data set when the numbers are listed in either
ascending or descending order.

 Median: Given that the data collection is arranged in ascending or descending order, the following
method is applied:

 If number of values or observations in the given data is odd, then the median is given by
[(n+1)/2]th observation.
 If in the given data set, the number of values or observations is even, then the median is given by
the average of (n/2)th and [(n/2) +1]th observation.

The median for grouped data can be calculated using the formula,
2.5.3 Mode

• The mode is the value that occurs the most often in a data set and the range is the difference
between the highest and lowest values in a data set.

EXAMPLE (MEAN)

1. Find the mean of the first 10 odd integers.

Solution:
First 10 odd integers: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19

Mean = Sum of the first 10 odd integers/Number of such integers

= (1 + 3 + 5 + 7 + 9 + 11 + 13 + 15 + 17 + 19)/10

= 100/10

= 10

Therefore, the mean of the first 10 odd integers is 10.

EXAMPLE (MEDIAN )

1) What is the median of the following data set?

32, 6, 21, 10, 8, 11, 12, 36, 17, 16, 15, 18, 40, 24, 21, 23, 24, 24, 29, 16, 32, 31, 10, 30, 35, 32, 18, 39,
12, 20

Solution:

The ascending order of the given data set is:

6, 8, 10, 10, 11, 12, 12, 15, 16, 16, 17, 18, 18, 20, 21, 21, 23, 24, 24, 24, 29, 30, 31, 32, 32, 32, 35, 36,
39, 40

Number of values in the data set = n = 30

n/2 = 30/2 = 15

15th data value = 21

(n/2) +1 = 16

16th data value = 21

Median = [(n/2)th observation + {(n/2)+1}th observation]/2

= (15th data value + 16th data value)/2

= (21 + 21)/2

= 21

EXAMPLE (MODE )

Identify the mode for the following data set:

21, 19, 62, 21, 66, 28, 66, 48, 79, 59, 28, 62, 63, 63, 48, 66, 59, 66, 94, 79, 19 94
Solution:

Let us write the given data set in ascending order as follows:

19, 19, 21, 21, 28, 28, 48, 48, 59, 59, 62, 62, 63, 63, 66, 66, 66, 66, 79, 79, 94, 94

Here, we can observe that the number 66 occurred the maximum number of times.

Thus, the mode of the given data set is 66.

FIG 2.6 : measure of central tendency mean,mode,median

2.6 DESCRIBING VARIABILITY

 Variability refers to how spread scores are in a distribution out; that is, it refers to the amount of
spread of the scores around the mean.

 For example, distributions with the same mean can have different amounts of variability or
dispersion

2.6 .1 Range
The most basic measure of variation is the range, which is the distance from the smallest to the largest
value in a distribution.
Range= Largest value – Smallest Value
2.6 .2 Inter Quartile Range

The inter quartile range (IQR) is the range of the middle 50% scores in a distribution:

IQR= 75th percentile – 25th percentile

 It is based on dividing a data set into quartiles.


 Quartiles are the values that divide scores into quarters.
 Q1 is the lower quartile and is the middle number between the smallest number and the median of
a data set.
 Q2 is the middle quartile-or median.
 Q3 is the upper quartile and is the middle value between the median set and the highest value of
a data set.
 The interquartile range formula is the first quartile subtracted from the third quartile

2.6 .3 Variance
The variance is the average squared difference of the scores from the mean.
To compute the variance in a population:
1. Calculate the mean
2. Subtract the mean from each score to compute the deviation from mean score
3. Square each deviation score (multiply each score by itself)
4. Add up the squared deviation score to give the sum
5. Divide the sum by the number of scores
The table below contains students’ scores on a Statistics test. To calculate the variance:

TABLE 2.3 Calculate The Variance


Deviation
Score Square Deviation
Mode
9 2 4
9 2 4
9 2 4
8 1 1
8 1 1
8 1 1
8 1 1
7 0 0
7 0 0
7 0 0
7 0 0
7 0 0
6 -1 1
6 -1 1
6 -1 1
6 -1 1
6 -1 1
6 -1 1
5 -2 4
5 -2 4

1. The mean is calculated: sum all scores and divide by the number of scores: 140/20= 7
2. The deviation from the mean for each score is calculated. For example, for the first score: 9-7= 2-
See column Deviation from the mean
3. Each deviation from the mean score is squared (multiplied by itself). For the first score: 2x2= 4.
See column Squared deviation.
4. Finally, the mean of the squared deviations is calculated. The variance is 1.5
the formula to calculate variance in a population looks like:

Where o2 is the variance

µ is the mean of a population

X are the values or scores

N is the number of values or scores


If the variance in a sample is used to estimate the variance in a population, it is important to note that
samples are consistently less variable than their populations:

o The sample variability gives a biased estimate of the population variability.

o This bias is in the direction of underestimating the population value.


o In order to adjust this consistent underestimation of the population variance, we divide the sum of
the squared deviation by N-1 instead of N.

Formula to calculate variance in a sample is:

Where s2 is the variance of the sample

M is the sample mean

X are the values or scores

N is the number of values or scores in the sample

2.6 .4 Standard Deviation


The standard deviation is the average amount by which scores differ from the mean.
The standard deviation is the square root of the variance, and it is a useful measure of variability when
the distribution is normal or approximately normal (see below on the normality of distributions).
The proportion of the distribution within a given number of standard deviations (or distance) from the
mean can be calculated.
A small standard deviation coefficient indicates a small degree of variability (that is, scores are close
together); larger standard deviation coefficients indicate large variability (that is, scores are far apart).
The formula to calculate the standard deviation is

EXAMPLE :

There are a total of 100 pirates on the ship. Statistically, it means that the population is 100. We
use the standard deviation equation for the entire population if we know a number of gold coins
every pirate has.Statistically, let’s consider a sample of 5 and here you can use the standard
deviation equation for this sample population.This means we have a sample size of 5 and in this
case, we use the standard deviation equation for the sample of a population.Consider the number
of gold coins 5 pirates have; 4, 2, 5, 8, 6.

Solution:

=20

Standard deviation of Grouped Data

In case of grouped data or grouped frequency distribution, the standard deviation can be found by
considering the frequency of data values. This can be understood with the help of an example.
N = ∑f = 55

Mean = (∑fxi)/N = 925/55 = 16.818

Variance = 1/(N – 1) [∑fxi2 – 1/N(∑fxi)2]

= 1/(55 – 1) [27575 – (1/55) (925)2]

= (1/54) [27575 – 15556.8182]

= 222.559

Standard deviation = √variance = √222.559 = 14.918

2.7.1 Sample Standard Deviation (s).

 A rough measure of the average amount by which scores in the sample


deviate on either side of their sample mean.
Where,
Ss= Sum of Squares ,n=count

2.7.2 Population Standard Deviation (σ).

 A rough measure of the average amount by which scores in


the population deviate on either side of their population mean

2.7.3 Degrees Of Freedom ( D F)

 Degrees of freedom (df) refers to the number of values that are free to
vary, given one or more mathematical restrictions, in a sample being
used to estimate a population characteristic.

 The number of values free to vary, given one or more mathematical


restrictions.degrees of freedom, that is, df = n – 1.

2.8 VARIABILITY FOR QUALITATIVE AND RANKED DATA

o Any statistical analysis is performed on data.

o Data is a collection of actual observations or scores in a survey or an experiment.

There are three types of data


1. Qualitative Data
2. Ranked Data
3. Quantitative Data
The precise form of a statistical analysis often depends on whether data are qualitative, ranked, or
quantitative.
Qualitative Data
 Qualitative Data is A set of observations where any single observation is a word,
letter, or numerical code that represents a class or category.

 Qualitative data consist of words (Yes or No), letters (Y or N), or numerical codes (0 or1)
that represent a class or category.
Ranked Data
 Ranked Data is A set of observations where any single observation is a number that
indicates relative standing.
 Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent relative
standing within a group.
Quantitative Data

 Quantitative Data A set of observations where any single observation is a


number that represents an amount or a count.
 Quantitative data consist of numbers (weights of 238, 170, . . . 185 lbs) that
represent an amount or a count.
 To determine the type of data, focus on a single observation in any collection of
observations.
 For example, the weights reported by 53 male students in Table 2.1 are
quantitative data, since any single observation, such as 160 lbs, represents an
amount of weight.

2.8 .1 Types of Categorical Data

In general, categorical data has values and observations which can be sorted into categories or
groups. The best way to represent these data is bar graphs and pie charts.

Categorical data are further classified into two types namely,

 Nominal Data
 Ordinal Data

Nominal Data

 Nominal data is a type of data that is used to label the variables without providing any numerical
value. It is also known as the nominal scale.

 Nominal data cannot be ordered and measured. But sometimes nominal data can be qualitative
and quantitative. Some of the few common examples of nominal data are letters, words, symbols,
gender etc.

 These data are analysed with the help of the grouping method. The variables are grouped
together into categories and the percentage or frequency can be calculated. It can be presented
visually using the pie chart.

Ordinal Data
 Ordinal data is a type of data that follows a natural order. The notable features of ordinal data are
that the difference between data values cannot be determined. It is commonly encountered in
surveys, questionnaires, finance and economics.

 The data can be analyzed using visualization tools. It is commonly represented using a bar chart.
Sometimes the data may be represented using tables in which each row in the table indicates the
distinct category.

FIG 2.7 : Categorical Data

2.9 NORMAL DISTRIBUTIONS

In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution,
is the most significant continuous probability distribution. Sometimes it is also called a bell curve.
A large number of random variables are either nearly or exactly represented by the normal
distribution, in every physical science and economics.
Furthermore, it can be used to approximate other probability distributions, therefore supporting the
usage of the word ‘normal ‘as in about the one, mostly used.

2.9.1 Normal Distribution Definition

The Normal Distribution is defined by the probability density function for a continuous random variable
in a system. Let us say, f(x) is the probability density function and X is the random variable. Hence, it
defines a function which is integrated between the range or interval (x to x + dx), giving the probability of
random variable X, by considering the values between x and x+dx.

2.9.2 Normal Distribution Formula


The probability density function of normal or gaussian distribution is given by;

2.9.3 Normal Distribution Curve


The random variables following the normal distribution are those whose values can find any
unknown value in a given range. For example, finding the height of the students in the school. Here, the
distribution can consider any value, but it will be bounded in the range say, 0 to 6ft. This limitation is
forced physically in our query.
Whereas, the normal distribution doesn’t even bother about the range. The range can also extend
to –∞ to + ∞ and still we can find a smooth curve. These random variables are called Continuous
Variables, and the Normal Distribution then provides here probability of the value lying in a particular
range for a given experiment. Also, use the normal distribution calculator to find the probability
density function by just providing the mean and standard deviation value.

2.9. 4 Normal Distribution Standard Deviation


Generally, the normal distribution has any positive standard deviation. We know that the mean helps
to determine the line of symmetry of a graph, whereas the standard deviation helps to know how far the
data are spread out. If the standard deviation is smaller, the data are somewhat close to each other and
the graph becomes narrower. If the standard deviation is larger, the data are dispersed more, and the
graph becomes wider. The standard deviations are used to subdivide the area under the normal curve.
Each subdivided section defines the percentage of data, which falls into the specific region of a graph.
Using 1 standard deviation, the Empirical Rule states that,
 Approximately 68% of the data falls within one standard deviation of the mean. (i.e., Between
Mean- one Standard Deviation and Mean + one standard deviation)
 Approximately 95% of the data falls within two standard deviations of the mean. (i.e., Between
Mean- two Standard Deviation and Mean + two standard deviations)
 Approximately 99.7% of the data fall within three standard deviations of the mean. (i.e., Between
Mean- three Standard Deviation and Mean + three standard deviations)
FIG 2.8 : Normal Distribution Standard Deviation

Thus, the empirical rule is also called the 68 – 95 – 99.7 rule.

2.9. 5 Normal Distribution Problems and Solutions


Question 1: Calculate the probability density function of normal distribution using the following
data. x = 3, μ = 4 and σ = 2.
Solution: Given, variable, x = 3
Mean = 4 and
Standard deviation = 2
By the formula of the probability density of normal distribution, we can write;

Hence, f(3,4,2) = 1.106.


Question 2: If the value of random variable is 2, mean is 5 and the standard deviation is 4, then
find the probability density function of the gaussian distribution.
Solution: Given,
Variable, x = 2
Mean = 5 and
Standard deviation = 4
By the formula of the probability density of normal distribution, we can write;

f(2,2,4) = 1/(4√2π) e0
f(2,2,4) = 0.0997
There are two main parameters of normal distribution in statistics namely mean and standard deviation.
The location and scale parameters of the given normal distribution can be estimated

2.9. 6 Normal Distribution Properties


Some of the important properties of the normal distribution are listed below:
 In a normal distribution, the mean, median and mode are equal.(i.e., Mean = Median= Mode).
 The total area under the curve should be equal to 1.
 The normally distributed curve should be symmetric at the centre.
 There should be exactly half of the values are to the right of the centre and exactly half of the
values are to the left of the centre.
 The normal distribution should be defined by the mean and standard deviation.
 The normal distribution curve must have only one peak. (i.e., Unimodal)
 The curve approaches the x-axis, but it never touches, and it extends farther away from the mean.

2.9. 7 Applications
The normal distributions are closely associated with many things such as:
 Marks scored on the test
 Heights of different persons
 Size of objects produced by the machine
 Blood pressure and so on.
2.10 Z - SCORES

 z- score gives us an idea of how far from the mean a data point .

 It is an important topic in statistics. Z-scores are a method to compare results to a “normal”

population.

 For example, we know someone’s weight is 70 kg, but if you want to compare it to the “average”

person’s weight, looking at a vast table of data can be overwhelming.

 A z-score gives us an idea of where that person’s weight is compared to the average population’s

mean weight. In this article, we will learn what is z score.

2.10.1 Z Score in Statistics

 A measure of how many standard deviations below or above the population mean a raw score is
called z score.
 It will be positive if the value lies above the mean and negative if it lies below the mean. It is also
known as standard score.

 It indicates how many standard deviations an entity is, from the mean. In order to use a z-score,
the mean μ and also the population standard deviation σ should be known.
 A z score helps to calculate the probability of a score occurring within a standard normal
distribution. It also enables us to compare two scores that are from different samples.

 A table for the values of ϕ, indicating the values of the cumulative distribution function of the
normal distribution is termed as a z score table.

2.10.2 Formula

The equation is given by z = (x – μ)/ σ.

μ = mean

σ = standard deviation

x = test value

When we have multiple samples and want to describe the standard deviation of those sample means,
we use the following formula:

z = (x – μ)/ (σ/√n)

2.10.3 Interpretation

1. If a z-score is equal to -1, then it denotes an element, which is 1 standard deviation less than the
mean.

2. If a z score is less than 0, then it denotes an element less than the mean.

3. If a z score is greater than 0, then it denotes an element greater than the mean.

4. If the z score is equal to 0, then it denotes an element equal to the mean.

5. If the z score is equal to 1, it denotes an element, which is 1 standard deviation greater than the mean;
a z score equal to 2 signifies 2 standard deviations greater than the mean; etc.

Example 1

The test score is 190. The test has a mean of 130 and a standard deviation of 30. Find the z score.
(Assume it is a normal distribution)

Solution:

Given test score x = 190

Mean, μ = 130

Standard deviation, σ = 30
So z = (x – μ)/ σ

= (190 – 130)/ 30

= 60/30

=2

Hence, the required z score is 2.

Example 2: You score 1100 for an exam. The mean score for the exam is 1026 and the standard
deviation is 209. How well did you score on the test compared to the average test taker?

Solution:

Given test score x = 1100

Mean, μ = 1026

Standard deviation, σ = 209

So z = (x – μ)/ σ

= (1100-1026)/209

= 0.354

This means that your score was 0.354 standard deviation above the mean.

2.10.4 Applications of z- score.

 Z-score is used in a medical field to find how a certain new born baby’s weight compares to the
mean weight of all babies.
 It is used to find how a certain shoe size compares to the mean population size.

2.11 CORRELATION

Correlation is a statistical technique to ascertain the association or relationship between two or more
variables. Correlation analysis is a statistical technique to study the degree and direction of relationship
between two or more variables

2.11.1 Correlation Co Efficient

A correlation coefficient is a statistical measure of the degree to which changes to the value of one
variable predict change to the value of another. When the fluctuation of one variable reliably predicts a
similar fluctuation in another variable, there’s often a tendency to think that means that the change in one
causes the change in the other.

2.11.2 Scatter Diagram

A scatter diagram is a diagram that shows the values of two variables X and Y, along with the way in
which these two variables relate to each other. The values of variable X are given along the horizontal
axis, with the values of the variable Y given on the vertical axis.

Later, when the regression model is used, one of the variables is defined as an independent variable, and
the other is defined as a dependent variable. In regression, the independent variable X is considered to
have some effect or influence on the dependent variable Y. Correlation methods are symmetric with
respect to the two variables, with no indication of causation or direction of influence being part of the
statistical consideration. A scatter diagram is given in the following example. The same example is later
used to determine the correlation coefficient.

2.11.3 TYPES OF CORRELATION

The scatter plot explains the correlation between the two attributes or variables. It represents how closely
the two variables are connected. There can be three such situations to see the relation between the two
variables

 Positive Correlation – when the values of the two variables move in the same direction so that an
increase/decrease in the value of one variable is followed by an increase/decrease in the value of
the other variable.
 Negative Correlation – when the values of the two variables move in the opposite direction so that
an increase/decrease in the value of one variable is followed by decrease/increase in the value of
the other variable.
 No Correlation – when there is no linear dependence or no relation between the two variables.
FIG 2.9 : Types Of Correlation

2.11.4 CORRELATION FORMULA

Correlation shows the relation between two variables. Correlation coefficient shows the
measure of correlation. To compare two datasets, we use the correlation formulas.

2.11.5 PEARSON CORRELATION COEFFICIENT FORMULA

The most common formula is the Pearson Correlation coefficient used for linear dependency between the
data sets. The value of the coefficient lies between -1 to +1. When the coefficient comes down to zero,
then the data is considered as not related. While, if we get the value of +1, then the data are positively
correlated, and -1 has a negative correlation.

Where n = Quantity of Information

Σx = Total of the First Variable Value

Σy = Total of the Second Variable Value

Σxy = Sum of the Product of first & Second Value


Σx2 = Sum of the Squares of the First Value

Σy2 = Sum of the Squares of the Second Value

2.11.6 LINEAR CORRELATION COEFFICIENT FORMULA

The formula for the linear correlation coefficient is given by;

2.11.7 SAMPLE CORRELATION COEFFICIENT FORMULA

rxy = Sxy/SxSy

Where Sx and Sy are the sample standard deviations, and Sxy is the sample covariance.

2.11.8 POPULATION CORRELATION COEFFICIENT FORMULA

The population correlation coefficient uses σx and σy as the population standard deviations and σxy as the
population covariance.

rxy = σxy/σxσ

2.11.9 Simple, Partial and Multiple Correlation:

The distinction between simple, partial and multiple correlation is based up on the number of
variables studied.

Simple Correlation:

When only two variables are studied,it is a case of simple correlation.For example,when onest

relationship between the marks secured by student and the attendance of student in class, it is a

problem of simple correlation

Partial Correlation:

Partial correlation is the measure of association between two variables, while controlling or
adjusting the effect of one or more additional variables.
Multiple Correlation:

When three or more variables are studied ,it is acase of multiple correlation. For example,
in above example if study covers.

2.11.10 Linear and Non-linear Correlation:

Depending upon the constancy of the ratio of change between the variables, the correlation may

be Linear or Non-linear Correlation.

Linear Correlation:

If the amount of change in one variable bears a constant ratio tothe amount of change in the other

variable, then correlation is said to be linear. If such variables are plotted on a graph paper all the plotted

points would fall on a straight line. For example: If it is assumed that, to produce one unit of finished

product we need10units of raw materials ,then sub sequently to produce 2 units of finished product we

need double of the one unit.



Raw material :X 10 20 30 40 50 60
Finished Product:Y 2 4 6 8 10 12

Non-linear Correlation: If the amount of change in one variable does not bear a constant ratio to the

amount of change to the other variable, then correlation is said to be non- linear. If such variables are

plotted on a graph, the points would fall on a curve and not on a straight line. For example, if we double

the amount of advertisement expenditure, then sales volume would not necessarily be doubled.

Karl Pearson’s Coefficient of Correlation:


 Karl Pearson’s method of calculating coefficient of correlation is based on the covariance of the
two variables in a series.

 This method is widely used in practice and the coefficient of correlation is denoted by the symbol
“r”. If the two variables understudy are X and Y, the following formula suggested by Karl Pearson
can be used for measuring the degree of relationship of correlation.
EXAMPLE :

1) Compute the coefficient of correlation between X and Y using the following data.

X: 1 3 5 7 8 10
Y: 8 12 15 17 18 20
Solution:

1 8 1 64 8
3 12 9 144 36
5 15 25 225 75
7 17 49 289 119 Thus n = 6
8 18 64 324 144
10 20 100 400 200
34 90 248 1446 582
Coefficient of correlation is

2) The marks obtained by the students in Mathematics and Statistics are given below. Find
the correlation Co-efficient between the two subjects.

Marks in 75 35 60 80 53 35 15 40 38 48
Mathematics
Marks in 85 45 54 91 58 63 35 43 45 44
Statistics

Solution:
Let the marks in mathematics & denotes the marks in statistics

75 85 5625 7225 6375


35 45 1225 2025 1575
60 54 3600 2916 3240
80 91 6400 2261 7280
53 58 2809 3364 3014
35 63 1225 3969 2205
15 35 225 1225 525
40 43 1600 1849 1720
38 45 1444 2025 1710
48 44 2304 1935 2112
479 563 26457 34814
The correlation co-efficient of is given by,

2.12 SCATTER PLOTS


Scatter plots are the graphs that present the relationship between two variables in a data-set. It
represents data points on a two-dimensional plane or on a Cartesian system. The independent variable
or attribute is plotted on the X-axis, while the dependent variable is plotted on the Y-axis. These plots are
often called scatter graphs or scatter diagrams.

2.12.1 Scatter Plots Graph:

A scatter plot is also called a scatter chart, scattergram, or scatter plot, XY graph. The scatter
diagram graphs numerical data pairs, with one variable on each axis, show their relationship. Now the
question comes for everyone: when to use a scatter plot?

Scatter plots are used in either of the following situations.


 When we have paired numerical data
 When there are multiple values of the dependent variable for a unique value of an independent
variable
 In determining the relationship between variables in some scenarios, such as identifying potential
root causes of problems, checking whether two products that appear to be related both occur with
the exact cause and so on.

2.12.2 Scatter Plot Uses and Examples


Scatter plots instantly report a large volume of data. It is beneficial in the following situations
 For a large set of data points given
 Each set comprises a pair of values
 The given data is in numeric form

FIG 2.10 Scatter Plot

Scatter plot Example


Let us understand how to construct a scatter plot with the help of the below example.
Question:
Draw a scatter plot for the given data that shows the number of games played and scores obtained
in each instance.

Solution:
X-axis or horizontal axis : Number of games
Y-axis or vertical axis : Scores
Now, the scatter graph will be :
2.13 REGRESSION

Regression analysis refers to assessing the relationship between the outcome variable and one

or more variables. The outcome variable is known as the dependent or response variable and the risk

elements, and co-founders are known as predictors or independent variables. The dependent variable

is shown by “y” and independent variables are shown by “x” in regression analysis.

For example, a correlation of r = 0.8 indicates a positive and strong association among two variables,

while a correlation of r = -0.3 shows a negative and weak association. A correlation near to zero shows the

non-existence of linear association among two continuous variables.

2.13.1 Linear Regression

Linear regression is a linear approach to modelling the relationship between the scalar

components and one or more independent variables. If the regression has one independent variable, then

it is known as a simple linear regression. If it has more than one independent variable, then it is known as

multiple linear regression. Linear regression only focuses on the conditional probability distribution of the

given values rather than the joint probability distribution. In general, all the real world regressions models

involve multiple predictors. So, the term linear regression often describes multivariate linear regression.
FIG 2.11 Correlation VS Regression

2.12.2 There are some differences between Correlation and regression.

 Correlation shows the quantity of the degree to which two variables are associated. It does not fix

a line through the data points. You compute a correlation that shows how much one variable

changes when the other remains constant. When r is 0.0, the relationship does not exist. When r

is positive, one variable goes high as the other goes up. When r is negative, one variable goes

high as the other goes down.

 Linear regression finds the best line that predicts y from x, but Correlation does not fit a line.

 Correlation is used when you measure both variables, while linear regression is mostly applied

when x is a variable that is manipulated.

2.12.3 Linear Regression Equation

The measure of the extent of the relationship between two variables is shown by the correlation
coefficient. The range of this coefficient lies between -1 to +1. This coefficient shows the strength of the
association of the observed data for two variables.

A linear regression line equation is written in the form of:

Y = a + bX

 where X is the independent variable and plotted along the x-axis

 Y is the dependent variable and plotted along the y-axis


 The slope of the line is b, and a is the intercept (the value of y when x = 0).

2.12.4 Linear Regression Formula

 Linear regression shows the linear relationship between two variables.


 The equation of linear regression is similar to the slope formula what we have learned before in
earlier classes such as linear equations in two variables.

 It is given by; Y= a + bX

 Now, here we need to find the value of the slope of the line, b, plotted in scatter plot and the
intercept, a.

2.13.1 Least Square Regression Line or Linear Regression Line

 The most popular method to fit a regression line in the XY plot is the method of least-squares.

This process determines the best-fitting line for the noted data by reducing the sum of the squares

of the vertical deviations from each data point to the line.

 If a point rests on the fitted line accurately, then its perpendicular deviation is 0.

 Because the variations are first squared, then added, their positive and negative values will not be

cancelled.
FIG 2.12 : Least Square Regression Line

 Linear regression determines the straight line, called the least-squares regression line or LSRL, that

best expresses observations in a bivariate analysis of data set. Suppose Y is a dependent variable,

and X is an independent variable, then the population regression line is given by;

Y = B0+B1X

Where

B0 is a constant

B1 is the regression coefficient

If a random sample of observations is given, then the regression line is expressed by;

ŷ = b0 + b1x

where b0 is a constant, b1 is the regression coefficient, x is the independent variable, and ŷ is the predicted

value of the dependent variable.


2.12.6 Properties of Linear Regression

For the regression line where the regression parameters b0 and b1 are defined, the properties are

given as:

 The line reduces the sum of squared differences between observed values and predicted values.

 The regression line passes through the mean of X and Y variable values

 The regression constant (b0) is equal to y-intercept the linear regression

 The regression coefficient (b1) is the slope of the regression line which is equal to the average

change in the dependent variable (Y) for a unit change in the independent variable (X).

2.12 .7 Regression Coefficient

In the linear regression line, we have seen the equation is given by;

Y = B0+B1X

Where

B0 is a constant

B1 is the regression coefficient

Now, let us see the formula to find the value of the regression coefficient.

B1 = b1 = Σ [ (xi – x)(yi – y) ] / Σ [ (xi – x)2]

Where xi and yi are the observed data sets.

And x and y are the mean value.

EXAMPLE:

1) Obtain the equation of the regression lines from the following data using the method of least squares. Hence find the
coefficient of correlation between X and Y. Also estimate the value of Y when and the value of when
: 22 26 29 30 31 33 34 35
: 20 20 21 29 27 24 27 31 (M/J 2009)
Solution:
Let
22 20 -8 -5 64 25 40
26 20 -4 -5 16 25 20
29 21 -1 -4 1 16 4
30 29 0 4 0 16 0
31 27 1 2 1 4 2
33 24 3 -1 9 1 -3
34 27 4 2 16 4 8
35 31 5 6 25 36 30

Hence the regression line is

Hence the regression line is


2) Obtain the equation of the lines of regression from the following data:
X: 1 2 3 4 5 6 7
Y: 9 8 10 12 11 13 14
Solution:

1 9 -3 -2 9 4 6

2 8 -2 -3 4 9 6

3 10 -1 -1 1 1 1

4 12 0 1 0 1 0

5 11 1 0 1 0 0

6 13 2 2 4 4 4

7 14 3 3 9 9 9
28 77 0 0 28 28 26

Equation of the line of regression of x on y is

Equation of the line of regression of y on x is

2.16 STANDARD ERROR OF ESTIMATION


 In statistics, the standard error is the standard deviation of the sample distribution.
 The sample mean of a data is generally varied from the actual population mean.

 It is represented as SE.

 It is used to measure the amount of accuracy by which the given sample represents its population.
Statistics is a vast topic in which we learn about data, sample and population, mean, median, mode,
dependent and independent variables, standard deviation, variance, etc. Here you will learn the
standard error formula along with SE of the mean and estimation.

2.16.1 Standard Error Formula


 The accuracy of a sample that describes a population is identified through the SE formula.
 The sample mean which deviates from the given population and that deviation is given as;

Where S is the standard deviation and n is the number of observations.

2.16.2 Standard Error of the Mean (SEM)


 The standard error of the mean also called the standard deviation of mean, is represented as the
standard deviation of the measure of the sample mean of the population. It is abbreviated as SEM.
For example, normally, the estimator of the population mean is the sample mean. But, if we draw
another sample from the same population, it may provide a distinct value.
 Thus, there would be a population of the sampled means having its distinct variance and mean. It
may be defined as the standard deviation of such sample means of all the possible samples taken
from the same given population. SEM defines an estimate of standard deviation which has been
computed from the sample. It is calculated as the ratio of the standard deviation to the root of
sample size, such as:

.
 Where ‘s’ is the standard deviation and n is the number of observations.
 The standard error of the mean shows us how the mean changes with different tests, estimating
the same quantity. Thus if the outcome of random variations is notable, then the standard error of
the mean will have a higher value. But, if there is no change observed in the data points after
repeated experiments, then the value of the standard error of the mean will be zero.
2.16.3 Standard Error of Estimate (SEE)

 The standard error of the estimate is the estimation of the accuracy of any predictions. It is
denoted as SEE. The regression line depreciates the sum of squared deviations of prediction. It is
also known as the sum of squares error. SEE is the square root of the average
squared deviation. The deviation of some estimates from intended values is given by standard

error of estimate formula.

 Where xi stands for data values, x bar is the mean value and n is the sample size.

2.16.4 Standard Error Formula

 Standard error is an important statistical measure and it is concerned with standard deviation.
 The accuracy of a sample that represents a population is knows through this formula.

 The sample mean deviates from the population and that deviation is called standard error formula.

Where,

s is the standard deviation

n is the number of observation

EXAMPLE :

Calculate the standard error of the given data:

x: 10, 12, 16, 21, 25

Solution:

Mean

Standard Deviation can be calculated as

=√(154.8/4)=√38.7
=6.22

Standard Error:

= 6.22/√5

= 6.22/2.236

= 2.782

2.16.4 How to calculate Standard Error

Step 1: Note the number of measurements (n) and determine the sample mean (μ). It is the average of

all the measurements.

Step 2: Determine how much each measurement varies from the mean.

Step 3: Square all the deviations determined in step 2 and add altogether: Σ(xi – μ)²

Step 4: Divide the sum from step 3 by one less than the total number of measurements (n-1).

Step 5: Take the square root of the obtained number, which is the standard deviation (σ).

Step 6: Finally, divide the standard deviation obtained by the square root of the number of

measurements (n) to get the standard error of your estimate.

Go through the example given below to understand the method of calculating standard error.

2.17 INTERPRETATION OF r2

 The coefficient of determination or R squared method is the proportion of the variance in the

dependent variable that is predicted from the independent variable.

 It indicates the level of variation in the given data set.

 The coefficient of determination is the square of the correlation(r), thus it ranges from 0 to 1.
 With linear regression, the coefficient of determination is equal to the square of the correlation
between the x and y variables.
 If R2 is equal to 0, then the dependent variable cannot be predicted from the independent
variable.
 If R2 is equal to 1, then the dependent variable can be predicted from the independent variable
without any error.
 If R2 is between 0 and 1, then it indicates the extent that the dependent variable can be
predictable. If R2 of 0.10 means, it is 10 percent of the variance in the y variable is predicted from
the x variable. If 0.20 means, 20 percent of the variance in the y variable is predicted from the x
variable, and so on.

2.17 .1 Coefficient of Determination Formula


 We can give the formula to find the coefficient of determination in two ways;
 one using correlation coefficient and the other one with sum of squares.
Formula 1:
As we know the formula of correlation coefficient is,

Where
n = Total number of observations
Σx = Total of the First Variable Value
Σy = Total of the Second Variable Value
Σxy = Sum of the Product of first & Second Value
Σx2 = Sum of the Squares of the First Value
Σy2 = Sum of the Squares of the Second Value
Thus, the coefficient of of determination = (correlation coefficient)2 = r2
Formula 2:
The formula of coefficient of determination is given by:
R2 = 1 – (RSS/TSS)
Where,
R2 = Coefficient of Determination
RSS = Residuals sum of squares
TSS = Total sum of squares

2.17.2 Properties of Coefficient of Determination


 It helps to get the ratio of how a variable which can be predicted from the other one, varies.

 If we want to check how clear it is to make predictions from the data given, we can determine the

same by this measurement.

 It helps to find Explained variation / Total Variation

 It also lets us know the strength of the association(linear) between the variables.

 If the value of r2 gets close to 1, The values of y become close to the regression line and similarly

if it goes close to 0, the values get away from the regression line.

 It helps in determining the strength of association between different variables.

2.17.3 Steps to Find the Coefficient of Determination


1. Find r, Correlation Coefficient
2. Square ‘r’.
3. Change the above value to a percentage.
EXAMPLE:

1. Find the coefficient of determination for the following set of data:

X Y
2 2
5 5
6 4
7 3

Solution:

Given data is

X Y
2 2
5 5
6 4
7 3

Create the table out of given scores

X Y XY X2 Y2
2 2 4 4 4
5 5 25 25 25
6 4 24 36 16
7 3 21 49 9
2 2
∑X=20 ∑Y=14 ∑XY=74 ∑X =114 ∑Y =54

Here

N=4

Coefficient of determination;

R2 = (0.478)2

= 0.22848
2.18 MULTIPLE REGRESSION EQUATION

 Multiple regression analysis is a statistical technique that analyzes the relationship between two
or more variables and uses the information to estimate the value of the dependent variables. In
multiple regression, the objective is to develop a model that describes a dependent variable y to
more than one independent variable.
 In linear regression, there is only one independent and dependent variable involved. But, in the
case of multiple regression, there will be a set of independent variables that helps us to explain
better or predict the dependent variable y.
 The multiple regression equation is given by
 y = a + b 1×1+ b2×2+……+ bkxk
 where x1, x2, ….xk are the k independent variables and y is the dependent variable.
2.18.1 Multiple Regression Analysis Definition
 Multiple regression analysis permits to control explicitly for many other circumstances that
concurrently influence the dependent variable.
 The objective of regression analysis is to model the relationship between a dependent variable and
one or more independent variables.
 Let k represent the number of variables and denoted by x1, x2, x3, ……, xk. Such an equation is
useful for the prediction of value for y when the values of x are known.
2.18.2 Stepwise Multiple Regression
 The Stepwise regression is a step by step process that begins by developing a regression
model with a single predictor variable and adds and deletes predictor variable one step at a
time.
 The Stepwise multiple regression is the method to determine a regression equation that begins
with a single independent variable and add independent variables one by one.
 The stepwise multiple regression method is also known as the forward selection method
because we begin with no independent variables and add one independent variable to the
regression equation at each of the iterations.
 There is another method called backwards elimination method, which begins with an entire set
of variables and eliminates one independent variable at each of the iterations.
Residual: The variations in the dependent variable explained by the regression model are called
residual or error variation. It is also known as random error or sometimes just “error”. This is a random
error due to different sampling methods.

2.18.3 Advantages of Stepwise Multiple Regression


 Only independent variables with non zero regression coefficients are included in the regression
equation.
 The changes in the multiple standard errors of estimate and the coefficient of determination are
shown.

 The stepwise multiple regression is efficient in finding the regression equation with only significant
regression coefficients.
 The steps involved in developing the regression equation are clear.
2.18.4 Multivariate Multiple Regression
 Mostly, the statistical inference has been kept at the bivariate level.
 Inferential statistical tests have also been developed for multivariate analyses, which analyses
the relation among more than two variables.
 Commonly used extension of correlation analysis for multivariate inferences is multiple
regression analysis.
 Multiple regression analysis shows the correlation between each set of independent and
dependent variables.

2.18.5 Multi collinearity


Multi collinearity is a term reserved to describe the case when the inter-correlation of predictor variables
is high.

2.18.6 Signs of Multicollinearity


 The high correlation between pairs of predictor variables.
 The magnitude or signs of regression coefficients do not make good physical sense.
 Non-significant regression coefficients on significant predictors.
 The ultimate sensitivity of magnitude or sign of regression coefficients leads to the insertion or
deletion of a predictor variable.

2.19 REGRESSION TO THE MEAN

 Regression to the mean (RTM) is a statistical phenomenon describing how variables much higher or
lower than the mean are often much closer to the mean when measured a second time.
 Regression to the mean is due to natural variation or chance. It can be observed in everyday life,
particularly in research that intentionally focuses on the most extreme cases or events. It is
sometimes also called regression toward the mean.

2.19.1 Regression to the mean examples

 Regression to the mean often happens when measuring the effects of an intervention.
 Example: Measuring the effects of an intervention You are interested in finding out whether
an online, self-paced course can help middle school students address learning gaps in
math. A school in your area agrees to be part of the pilot study.
 To find out which students are most in need, you administer a math test to a class of 8th-
grade students. You pick the worst-performing 10% of students, and assign them to the
online course.

 When the course is complete, the 10% of students with the worst performance take
another test. Their scores, on average, show improvement. The principal, pleased with the
result, decides to launch the online course for all 8th-grade students who are
underperforming in math.

 At the end of the year, these students’ scores are not much better than they were the
previous year. They certainly didn’t improve to the degree you expected based upon the
results of the worst-performing 10% of students.

 The problem here is regression to the mean. Among the students who did poorly on the
first test were also students who didn’t perform well due to chance: perhaps they didn’t
sleep well the night before, or they were sick or stressed out. These students were going
to do better on the second test regardless of the intervention (the online program). Thus,
they brought up the average score of the worst-performing 10%.

 Relatedly, randomized evaluations are essential in avoiding regression to the mean when
estimating the effects of an intervention.
PART-A Questions
1. Define correlation.
2. Define correlation coefficient.
3. Explain the various applications of correlation.
4. Define multiple correlation.

5. Assess relationships between variables in different scenarios and determine the


nature of the correlation based on data patterns.( Positive Correlation, negative
Correlation, NO Correlation)

6. Apply the properties of correlation coefficients.

7. Explain the applications of regression analysis.

8. Differentiate between correlation and regression.

9. Define regression coefficient.

10. Explain the characteristics and differences between the types of data.

11. Define mean.

12. Define median

13. Define outliers.


14. Define mode

15. Apply graphical representations to visually display data, such as histograms, scatter
plots, and bar charts, to facilitate a better understanding of patterns and trends.

16. Explain probability distributions.

17. Define range

18. Define standard deviation.

19. Apply the concept of relative frequency distribution to create a table.

20. Define z scores.

21. Define Population Standard Deviation (σ).


PART-B

1. Evaluate the appropriateness of selecting a specific type of correlation


2. Discuss strategies for identifying and dealing with outliers.
3. Apply different types of graphs (histograms, box plots, etc.) to visually represent data
distributions.
4. Describe the steps involved in creating a frequency distribution and provide an example.
5. Compare and contrast the mean, median, and mode. Discuss situations where each is most
appropriate.
6. Define and calculate the interquartile range. Discuss its advantages over the range.
7. Explain the characteristics of a normal distribution and its importance in statistical analysis.
8. provide examples of scatter plots and interpret the relationships shown.
9. Explain the purpose of regression analysis and how it is used for prediction.
10. Discuss the concept of multiple regression and its applications.
11. Find the coefficient of correlation for the following heights (in inches) of father (X) and their
sons (Y)
X: 65 66 67 67 68 69 70 72
Y: 67 68 65 68 72 72 69 71 [AU N/D 2016,A/M 2021] .
12.Given the following pairs of values:
Capital Employed(Rs.InCrore) 1 2 3 4 5 7 8 9 11 12

Profit(Rs.InLakhs) 3 5 4 7 9 8 10 11 12 14
i.Draw a scatter diagram.
ii.Do you think that there is any correlation between profits and capital employed? Is it
positive or negative? Is it high or low?
13. Find Karl Pearson’s coefficient of correlation between capital employed and profit
obtained from the following data.
Capital Employed 10 20 30 40 50 60 70 80 90 100
(Rs.InCrore)

Profit (Rs.InCrore) 2 4 8 5 10 15 14 20 22 50

You might also like