FDSA unit 2
FDSA unit 2
PROCESS MANAGEMENT
Normal distributions – z scores – normal curve problems – finding proportions – finding scores –
more about z scores – correlation – scatter plots – correlation coefficient for quantitative data –
computational formula for correlation coefficient – regression – regression line – least squares
regression line – standard error of estimate – interpretation of r2 – multiple regression equations
– regression toward the mean .
f(x) ≥ 0 ∀ x ϵ (−∞,+∞)
Where,
x is the variable
μ is the mean
σ is the standard deviation
The random variables following the normal distribution are those whose values can find any
unknown value in a given range. For example, finding the height of the students in the school. Here, the
distribution can consider any value, but it will be bounded in the range say, 0 to 6ft. This limitation is
forced physically in our query.
Whereas, the normal distribution doesn’t even bother about the range. The range can also extend to –∞
to + ∞ and still we can find a smooth curve. These random variables are called Continuous Variables, and
the Normal Distribution then provides here probability of the value lying in a particular range for a given
experiment.
Generally, the normal distribution has any positive standard deviation. We know that the mean helps to
determine the line of symmetry of a graph, whereas the standard deviation helps to know how far the data
are spread out. If the standard deviation is smaller, the data are somewhat close to each other and the
graph becomes narrower. If the standard deviation is larger, the data are dispersed more, and the graph
becomes wider. The standard deviations are used to subdivide the area under the normal curve. Each
subdivided section defines the percentage of data, which falls into the specific region of a graph.
Approximately 68% of the data falls within one standard deviation of the mean. (i.e., Between
Mean- one Standard Deviation and Mean + one standard deviation)
Approximately 95% of the data falls within two standard deviations of the mean. (i.e., Between
Mean- two Standard Deviation and Mean + two standard deviations)
Approximately 99.7% of the data fall within three standard deviations of the mean. (i.e., Between
Mean- three Standard Deviation and Mean + three standard deviations)
The Mean, Median and Mode are the three measures of central tendency.
2.5.1 Mean
This is found by adding the numbers in a data set and dividing by the number of observations in
the data set.
2.5.2 Median
The median is the middle number in a data set when the numbers are listed in either
ascending or descending order.
Median: Given that the data collection is arranged in ascending or descending order, the following
method is applied:
If number of values or observations in the given data is odd, then the median is given by
[(n+1)/2]th observation.
If in the given data set, the number of values or observations is even, then the median is given by
the average of (n/2)th and [(n/2) +1]th observation.
The median for grouped data can be calculated using the formula,
2.5.3 Mode
• The mode is the value that occurs the most often in a data set and the range is the difference
between the highest and lowest values in a data set.
EXAMPLE (MEAN)
Solution:
First 10 odd integers: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19
= (1 + 3 + 5 + 7 + 9 + 11 + 13 + 15 + 17 + 19)/10
= 100/10
= 10
EXAMPLE (MEDIAN )
32, 6, 21, 10, 8, 11, 12, 36, 17, 16, 15, 18, 40, 24, 21, 23, 24, 24, 29, 16, 32, 31, 10, 30, 35, 32, 18, 39,
12, 20
Solution:
6, 8, 10, 10, 11, 12, 12, 15, 16, 16, 17, 18, 18, 20, 21, 21, 23, 24, 24, 24, 29, 30, 31, 32, 32, 32, 35, 36,
39, 40
n/2 = 30/2 = 15
(n/2) +1 = 16
= (21 + 21)/2
= 21
EXAMPLE (MODE )
21, 19, 62, 21, 66, 28, 66, 48, 79, 59, 28, 62, 63, 63, 48, 66, 59, 66, 94, 79, 19 94
Solution:
19, 19, 21, 21, 28, 28, 48, 48, 59, 59, 62, 62, 63, 63, 66, 66, 66, 66, 79, 79, 94, 94
Here, we can observe that the number 66 occurred the maximum number of times.
Variability refers to how spread scores are in a distribution out; that is, it refers to the amount of
spread of the scores around the mean.
For example, distributions with the same mean can have different amounts of variability or
dispersion
2.6 .1 Range
The most basic measure of variation is the range, which is the distance from the smallest to the largest
value in a distribution.
Range= Largest value – Smallest Value
2.6 .2 Inter Quartile Range
The inter quartile range (IQR) is the range of the middle 50% scores in a distribution:
2.6 .3 Variance
The variance is the average squared difference of the scores from the mean.
To compute the variance in a population:
1. Calculate the mean
2. Subtract the mean from each score to compute the deviation from mean score
3. Square each deviation score (multiply each score by itself)
4. Add up the squared deviation score to give the sum
5. Divide the sum by the number of scores
The table below contains students’ scores on a Statistics test. To calculate the variance:
1. The mean is calculated: sum all scores and divide by the number of scores: 140/20= 7
2. The deviation from the mean for each score is calculated. For example, for the first score: 9-7= 2-
See column Deviation from the mean
3. Each deviation from the mean score is squared (multiplied by itself). For the first score: 2x2= 4.
See column Squared deviation.
4. Finally, the mean of the squared deviations is calculated. The variance is 1.5
the formula to calculate variance in a population looks like:
EXAMPLE :
There are a total of 100 pirates on the ship. Statistically, it means that the population is 100. We
use the standard deviation equation for the entire population if we know a number of gold coins
every pirate has.Statistically, let’s consider a sample of 5 and here you can use the standard
deviation equation for this sample population.This means we have a sample size of 5 and in this
case, we use the standard deviation equation for the sample of a population.Consider the number
of gold coins 5 pirates have; 4, 2, 5, 8, 6.
Solution:
=20
In case of grouped data or grouped frequency distribution, the standard deviation can be found by
considering the frequency of data values. This can be understood with the help of an example.
N = ∑f = 55
= 222.559
Degrees of freedom (df) refers to the number of values that are free to
vary, given one or more mathematical restrictions, in a sample being
used to estimate a population characteristic.
Qualitative data consist of words (Yes or No), letters (Y or N), or numerical codes (0 or1)
that represent a class or category.
Ranked Data
Ranked Data is A set of observations where any single observation is a number that
indicates relative standing.
Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent relative
standing within a group.
Quantitative Data
In general, categorical data has values and observations which can be sorted into categories or
groups. The best way to represent these data is bar graphs and pie charts.
Nominal Data
Ordinal Data
Nominal Data
Nominal data is a type of data that is used to label the variables without providing any numerical
value. It is also known as the nominal scale.
Nominal data cannot be ordered and measured. But sometimes nominal data can be qualitative
and quantitative. Some of the few common examples of nominal data are letters, words, symbols,
gender etc.
These data are analysed with the help of the grouping method. The variables are grouped
together into categories and the percentage or frequency can be calculated. It can be presented
visually using the pie chart.
Ordinal Data
Ordinal data is a type of data that follows a natural order. The notable features of ordinal data are
that the difference between data values cannot be determined. It is commonly encountered in
surveys, questionnaires, finance and economics.
The data can be analyzed using visualization tools. It is commonly represented using a bar chart.
Sometimes the data may be represented using tables in which each row in the table indicates the
distinct category.
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution,
is the most significant continuous probability distribution. Sometimes it is also called a bell curve.
A large number of random variables are either nearly or exactly represented by the normal
distribution, in every physical science and economics.
Furthermore, it can be used to approximate other probability distributions, therefore supporting the
usage of the word ‘normal ‘as in about the one, mostly used.
The Normal Distribution is defined by the probability density function for a continuous random variable
in a system. Let us say, f(x) is the probability density function and X is the random variable. Hence, it
defines a function which is integrated between the range or interval (x to x + dx), giving the probability of
random variable X, by considering the values between x and x+dx.
f(2,2,4) = 1/(4√2π) e0
f(2,2,4) = 0.0997
There are two main parameters of normal distribution in statistics namely mean and standard deviation.
The location and scale parameters of the given normal distribution can be estimated
2.9. 7 Applications
The normal distributions are closely associated with many things such as:
Marks scored on the test
Heights of different persons
Size of objects produced by the machine
Blood pressure and so on.
2.10 Z - SCORES
z- score gives us an idea of how far from the mean a data point .
population.
For example, we know someone’s weight is 70 kg, but if you want to compare it to the “average”
A z-score gives us an idea of where that person’s weight is compared to the average population’s
A measure of how many standard deviations below or above the population mean a raw score is
called z score.
It will be positive if the value lies above the mean and negative if it lies below the mean. It is also
known as standard score.
It indicates how many standard deviations an entity is, from the mean. In order to use a z-score,
the mean μ and also the population standard deviation σ should be known.
A z score helps to calculate the probability of a score occurring within a standard normal
distribution. It also enables us to compare two scores that are from different samples.
A table for the values of ϕ, indicating the values of the cumulative distribution function of the
normal distribution is termed as a z score table.
2.10.2 Formula
μ = mean
σ = standard deviation
x = test value
When we have multiple samples and want to describe the standard deviation of those sample means,
we use the following formula:
z = (x – μ)/ (σ/√n)
2.10.3 Interpretation
1. If a z-score is equal to -1, then it denotes an element, which is 1 standard deviation less than the
mean.
2. If a z score is less than 0, then it denotes an element less than the mean.
3. If a z score is greater than 0, then it denotes an element greater than the mean.
5. If the z score is equal to 1, it denotes an element, which is 1 standard deviation greater than the mean;
a z score equal to 2 signifies 2 standard deviations greater than the mean; etc.
Example 1
The test score is 190. The test has a mean of 130 and a standard deviation of 30. Find the z score.
(Assume it is a normal distribution)
Solution:
Mean, μ = 130
Standard deviation, σ = 30
So z = (x – μ)/ σ
= (190 – 130)/ 30
= 60/30
=2
Example 2: You score 1100 for an exam. The mean score for the exam is 1026 and the standard
deviation is 209. How well did you score on the test compared to the average test taker?
Solution:
Mean, μ = 1026
So z = (x – μ)/ σ
= (1100-1026)/209
= 0.354
This means that your score was 0.354 standard deviation above the mean.
Z-score is used in a medical field to find how a certain new born baby’s weight compares to the
mean weight of all babies.
It is used to find how a certain shoe size compares to the mean population size.
2.11 CORRELATION
Correlation is a statistical technique to ascertain the association or relationship between two or more
variables. Correlation analysis is a statistical technique to study the degree and direction of relationship
between two or more variables
A correlation coefficient is a statistical measure of the degree to which changes to the value of one
variable predict change to the value of another. When the fluctuation of one variable reliably predicts a
similar fluctuation in another variable, there’s often a tendency to think that means that the change in one
causes the change in the other.
A scatter diagram is a diagram that shows the values of two variables X and Y, along with the way in
which these two variables relate to each other. The values of variable X are given along the horizontal
axis, with the values of the variable Y given on the vertical axis.
Later, when the regression model is used, one of the variables is defined as an independent variable, and
the other is defined as a dependent variable. In regression, the independent variable X is considered to
have some effect or influence on the dependent variable Y. Correlation methods are symmetric with
respect to the two variables, with no indication of causation or direction of influence being part of the
statistical consideration. A scatter diagram is given in the following example. The same example is later
used to determine the correlation coefficient.
The scatter plot explains the correlation between the two attributes or variables. It represents how closely
the two variables are connected. There can be three such situations to see the relation between the two
variables
Positive Correlation – when the values of the two variables move in the same direction so that an
increase/decrease in the value of one variable is followed by an increase/decrease in the value of
the other variable.
Negative Correlation – when the values of the two variables move in the opposite direction so that
an increase/decrease in the value of one variable is followed by decrease/increase in the value of
the other variable.
No Correlation – when there is no linear dependence or no relation between the two variables.
FIG 2.9 : Types Of Correlation
Correlation shows the relation between two variables. Correlation coefficient shows the
measure of correlation. To compare two datasets, we use the correlation formulas.
The most common formula is the Pearson Correlation coefficient used for linear dependency between the
data sets. The value of the coefficient lies between -1 to +1. When the coefficient comes down to zero,
then the data is considered as not related. While, if we get the value of +1, then the data are positively
correlated, and -1 has a negative correlation.
rxy = Sxy/SxSy
Where Sx and Sy are the sample standard deviations, and Sxy is the sample covariance.
The population correlation coefficient uses σx and σy as the population standard deviations and σxy as the
population covariance.
rxy = σxy/σxσ
The distinction between simple, partial and multiple correlation is based up on the number of
variables studied.
Simple Correlation:
When only two variables are studied,it is a case of simple correlation.For example,when onest
relationship between the marks secured by student and the attendance of student in class, it is a
Partial Correlation:
Partial correlation is the measure of association between two variables, while controlling or
adjusting the effect of one or more additional variables.
Multiple Correlation:
When three or more variables are studied ,it is acase of multiple correlation. For example,
in above example if study covers.
Depending upon the constancy of the ratio of change between the variables, the correlation may
Linear Correlation:
If the amount of change in one variable bears a constant ratio tothe amount of change in the other
variable, then correlation is said to be linear. If such variables are plotted on a graph paper all the plotted
points would fall on a straight line. For example: If it is assumed that, to produce one unit of finished
product we need10units of raw materials ,then sub sequently to produce 2 units of finished product we
Raw material :X 10 20 30 40 50 60
Finished Product:Y 2 4 6 8 10 12
Non-linear Correlation: If the amount of change in one variable does not bear a constant ratio to the
amount of change to the other variable, then correlation is said to be non- linear. If such variables are
plotted on a graph, the points would fall on a curve and not on a straight line. For example, if we double
the amount of advertisement expenditure, then sales volume would not necessarily be doubled.
This method is widely used in practice and the coefficient of correlation is denoted by the symbol
“r”. If the two variables understudy are X and Y, the following formula suggested by Karl Pearson
can be used for measuring the degree of relationship of correlation.
EXAMPLE :
1) Compute the coefficient of correlation between X and Y using the following data.
X: 1 3 5 7 8 10
Y: 8 12 15 17 18 20
Solution:
1 8 1 64 8
3 12 9 144 36
5 15 25 225 75
7 17 49 289 119 Thus n = 6
8 18 64 324 144
10 20 100 400 200
34 90 248 1446 582
Coefficient of correlation is
2) The marks obtained by the students in Mathematics and Statistics are given below. Find
the correlation Co-efficient between the two subjects.
Marks in 75 35 60 80 53 35 15 40 38 48
Mathematics
Marks in 85 45 54 91 58 63 35 43 45 44
Statistics
Solution:
Let the marks in mathematics & denotes the marks in statistics
A scatter plot is also called a scatter chart, scattergram, or scatter plot, XY graph. The scatter
diagram graphs numerical data pairs, with one variable on each axis, show their relationship. Now the
question comes for everyone: when to use a scatter plot?
Solution:
X-axis or horizontal axis : Number of games
Y-axis or vertical axis : Scores
Now, the scatter graph will be :
2.13 REGRESSION
Regression analysis refers to assessing the relationship between the outcome variable and one
or more variables. The outcome variable is known as the dependent or response variable and the risk
elements, and co-founders are known as predictors or independent variables. The dependent variable
is shown by “y” and independent variables are shown by “x” in regression analysis.
For example, a correlation of r = 0.8 indicates a positive and strong association among two variables,
while a correlation of r = -0.3 shows a negative and weak association. A correlation near to zero shows the
Linear regression is a linear approach to modelling the relationship between the scalar
components and one or more independent variables. If the regression has one independent variable, then
it is known as a simple linear regression. If it has more than one independent variable, then it is known as
multiple linear regression. Linear regression only focuses on the conditional probability distribution of the
given values rather than the joint probability distribution. In general, all the real world regressions models
involve multiple predictors. So, the term linear regression often describes multivariate linear regression.
FIG 2.11 Correlation VS Regression
Correlation shows the quantity of the degree to which two variables are associated. It does not fix
a line through the data points. You compute a correlation that shows how much one variable
changes when the other remains constant. When r is 0.0, the relationship does not exist. When r
is positive, one variable goes high as the other goes up. When r is negative, one variable goes
Linear regression finds the best line that predicts y from x, but Correlation does not fit a line.
Correlation is used when you measure both variables, while linear regression is mostly applied
The measure of the extent of the relationship between two variables is shown by the correlation
coefficient. The range of this coefficient lies between -1 to +1. This coefficient shows the strength of the
association of the observed data for two variables.
Y = a + bX
It is given by; Y= a + bX
Now, here we need to find the value of the slope of the line, b, plotted in scatter plot and the
intercept, a.
The most popular method to fit a regression line in the XY plot is the method of least-squares.
This process determines the best-fitting line for the noted data by reducing the sum of the squares
If a point rests on the fitted line accurately, then its perpendicular deviation is 0.
Because the variations are first squared, then added, their positive and negative values will not be
cancelled.
FIG 2.12 : Least Square Regression Line
Linear regression determines the straight line, called the least-squares regression line or LSRL, that
best expresses observations in a bivariate analysis of data set. Suppose Y is a dependent variable,
and X is an independent variable, then the population regression line is given by;
Y = B0+B1X
Where
B0 is a constant
If a random sample of observations is given, then the regression line is expressed by;
ŷ = b0 + b1x
where b0 is a constant, b1 is the regression coefficient, x is the independent variable, and ŷ is the predicted
For the regression line where the regression parameters b0 and b1 are defined, the properties are
given as:
The line reduces the sum of squared differences between observed values and predicted values.
The regression line passes through the mean of X and Y variable values
The regression coefficient (b1) is the slope of the regression line which is equal to the average
change in the dependent variable (Y) for a unit change in the independent variable (X).
In the linear regression line, we have seen the equation is given by;
Y = B0+B1X
Where
B0 is a constant
Now, let us see the formula to find the value of the regression coefficient.
EXAMPLE:
1) Obtain the equation of the regression lines from the following data using the method of least squares. Hence find the
coefficient of correlation between X and Y. Also estimate the value of Y when and the value of when
: 22 26 29 30 31 33 34 35
: 20 20 21 29 27 24 27 31 (M/J 2009)
Solution:
Let
22 20 -8 -5 64 25 40
26 20 -4 -5 16 25 20
29 21 -1 -4 1 16 4
30 29 0 4 0 16 0
31 27 1 2 1 4 2
33 24 3 -1 9 1 -3
34 27 4 2 16 4 8
35 31 5 6 25 36 30
1 9 -3 -2 9 4 6
2 8 -2 -3 4 9 6
3 10 -1 -1 1 1 1
4 12 0 1 0 1 0
5 11 1 0 1 0 0
6 13 2 2 4 4 4
7 14 3 3 9 9 9
28 77 0 0 28 28 26
It is represented as SE.
It is used to measure the amount of accuracy by which the given sample represents its population.
Statistics is a vast topic in which we learn about data, sample and population, mean, median, mode,
dependent and independent variables, standard deviation, variance, etc. Here you will learn the
standard error formula along with SE of the mean and estimation.
.
Where ‘s’ is the standard deviation and n is the number of observations.
The standard error of the mean shows us how the mean changes with different tests, estimating
the same quantity. Thus if the outcome of random variations is notable, then the standard error of
the mean will have a higher value. But, if there is no change observed in the data points after
repeated experiments, then the value of the standard error of the mean will be zero.
2.16.3 Standard Error of Estimate (SEE)
The standard error of the estimate is the estimation of the accuracy of any predictions. It is
denoted as SEE. The regression line depreciates the sum of squared deviations of prediction. It is
also known as the sum of squares error. SEE is the square root of the average
squared deviation. The deviation of some estimates from intended values is given by standard
Where xi stands for data values, x bar is the mean value and n is the sample size.
Standard error is an important statistical measure and it is concerned with standard deviation.
The accuracy of a sample that represents a population is knows through this formula.
The sample mean deviates from the population and that deviation is called standard error formula.
Where,
EXAMPLE :
Solution:
Mean
=√(154.8/4)=√38.7
=6.22
Standard Error:
= 6.22/√5
= 6.22/2.236
= 2.782
Step 1: Note the number of measurements (n) and determine the sample mean (μ). It is the average of
Step 2: Determine how much each measurement varies from the mean.
Step 3: Square all the deviations determined in step 2 and add altogether: Σ(xi – μ)²
Step 4: Divide the sum from step 3 by one less than the total number of measurements (n-1).
Step 5: Take the square root of the obtained number, which is the standard deviation (σ).
Step 6: Finally, divide the standard deviation obtained by the square root of the number of
Go through the example given below to understand the method of calculating standard error.
2.17 INTERPRETATION OF r2
The coefficient of determination or R squared method is the proportion of the variance in the
The coefficient of determination is the square of the correlation(r), thus it ranges from 0 to 1.
With linear regression, the coefficient of determination is equal to the square of the correlation
between the x and y variables.
If R2 is equal to 0, then the dependent variable cannot be predicted from the independent
variable.
If R2 is equal to 1, then the dependent variable can be predicted from the independent variable
without any error.
If R2 is between 0 and 1, then it indicates the extent that the dependent variable can be
predictable. If R2 of 0.10 means, it is 10 percent of the variance in the y variable is predicted from
the x variable. If 0.20 means, 20 percent of the variance in the y variable is predicted from the x
variable, and so on.
Where
n = Total number of observations
Σx = Total of the First Variable Value
Σy = Total of the Second Variable Value
Σxy = Sum of the Product of first & Second Value
Σx2 = Sum of the Squares of the First Value
Σy2 = Sum of the Squares of the Second Value
Thus, the coefficient of of determination = (correlation coefficient)2 = r2
Formula 2:
The formula of coefficient of determination is given by:
R2 = 1 – (RSS/TSS)
Where,
R2 = Coefficient of Determination
RSS = Residuals sum of squares
TSS = Total sum of squares
If we want to check how clear it is to make predictions from the data given, we can determine the
It also lets us know the strength of the association(linear) between the variables.
If the value of r2 gets close to 1, The values of y become close to the regression line and similarly
if it goes close to 0, the values get away from the regression line.
X Y
2 2
5 5
6 4
7 3
Solution:
Given data is
X Y
2 2
5 5
6 4
7 3
X Y XY X2 Y2
2 2 4 4 4
5 5 25 25 25
6 4 24 36 16
7 3 21 49 9
2 2
∑X=20 ∑Y=14 ∑XY=74 ∑X =114 ∑Y =54
Here
N=4
Coefficient of determination;
R2 = (0.478)2
= 0.22848
2.18 MULTIPLE REGRESSION EQUATION
Multiple regression analysis is a statistical technique that analyzes the relationship between two
or more variables and uses the information to estimate the value of the dependent variables. In
multiple regression, the objective is to develop a model that describes a dependent variable y to
more than one independent variable.
In linear regression, there is only one independent and dependent variable involved. But, in the
case of multiple regression, there will be a set of independent variables that helps us to explain
better or predict the dependent variable y.
The multiple regression equation is given by
y = a + b 1×1+ b2×2+……+ bkxk
where x1, x2, ….xk are the k independent variables and y is the dependent variable.
2.18.1 Multiple Regression Analysis Definition
Multiple regression analysis permits to control explicitly for many other circumstances that
concurrently influence the dependent variable.
The objective of regression analysis is to model the relationship between a dependent variable and
one or more independent variables.
Let k represent the number of variables and denoted by x1, x2, x3, ……, xk. Such an equation is
useful for the prediction of value for y when the values of x are known.
2.18.2 Stepwise Multiple Regression
The Stepwise regression is a step by step process that begins by developing a regression
model with a single predictor variable and adds and deletes predictor variable one step at a
time.
The Stepwise multiple regression is the method to determine a regression equation that begins
with a single independent variable and add independent variables one by one.
The stepwise multiple regression method is also known as the forward selection method
because we begin with no independent variables and add one independent variable to the
regression equation at each of the iterations.
There is another method called backwards elimination method, which begins with an entire set
of variables and eliminates one independent variable at each of the iterations.
Residual: The variations in the dependent variable explained by the regression model are called
residual or error variation. It is also known as random error or sometimes just “error”. This is a random
error due to different sampling methods.
The stepwise multiple regression is efficient in finding the regression equation with only significant
regression coefficients.
The steps involved in developing the regression equation are clear.
2.18.4 Multivariate Multiple Regression
Mostly, the statistical inference has been kept at the bivariate level.
Inferential statistical tests have also been developed for multivariate analyses, which analyses
the relation among more than two variables.
Commonly used extension of correlation analysis for multivariate inferences is multiple
regression analysis.
Multiple regression analysis shows the correlation between each set of independent and
dependent variables.
Regression to the mean (RTM) is a statistical phenomenon describing how variables much higher or
lower than the mean are often much closer to the mean when measured a second time.
Regression to the mean is due to natural variation or chance. It can be observed in everyday life,
particularly in research that intentionally focuses on the most extreme cases or events. It is
sometimes also called regression toward the mean.
Regression to the mean often happens when measuring the effects of an intervention.
Example: Measuring the effects of an intervention You are interested in finding out whether
an online, self-paced course can help middle school students address learning gaps in
math. A school in your area agrees to be part of the pilot study.
To find out which students are most in need, you administer a math test to a class of 8th-
grade students. You pick the worst-performing 10% of students, and assign them to the
online course.
When the course is complete, the 10% of students with the worst performance take
another test. Their scores, on average, show improvement. The principal, pleased with the
result, decides to launch the online course for all 8th-grade students who are
underperforming in math.
At the end of the year, these students’ scores are not much better than they were the
previous year. They certainly didn’t improve to the degree you expected based upon the
results of the worst-performing 10% of students.
The problem here is regression to the mean. Among the students who did poorly on the
first test were also students who didn’t perform well due to chance: perhaps they didn’t
sleep well the night before, or they were sick or stressed out. These students were going
to do better on the second test regardless of the intervention (the online program). Thus,
they brought up the average score of the worst-performing 10%.
Relatedly, randomized evaluations are essential in avoiding regression to the mean when
estimating the effects of an intervention.
PART-A Questions
1. Define correlation.
2. Define correlation coefficient.
3. Explain the various applications of correlation.
4. Define multiple correlation.
10. Explain the characteristics and differences between the types of data.
15. Apply graphical representations to visually display data, such as histograms, scatter
plots, and bar charts, to facilitate a better understanding of patterns and trends.
Profit(Rs.InLakhs) 3 5 4 7 9 8 10 11 12 14
i.Draw a scatter diagram.
ii.Do you think that there is any correlation between profits and capital employed? Is it
positive or negative? Is it high or low?
13. Find Karl Pearson’s coefficient of correlation between capital employed and profit
obtained from the following data.
Capital Employed 10 20 30 40 50 60 70 80 90 100
(Rs.InCrore)
Profit (Rs.InCrore) 2 4 8 5 10 15 14 20 22 50