Statistics and Correlation Notes
Statistics and Correlation Notes
In this article, we will learn in detail about importance of statistics, role of statistics in
everyday life and misconceptions related to it.
Table of Content
• What is Statistics?
What is Statistics?
Descriptive Statistics
It involves summarizing and organizing any piece of data so that we can easily
understand and collect valuable information from that. There are some common
techniques that we use for descriptive analysis of data that is a measure of central
tendency(Mean, Median, and Mode) and the measure of variability (range, variance, and
standard deviation). These techniques are used to find out insights from data.
Inferential Statistics
Probability
Using inferential statistics we can find out the probability of any event to occur. It is the
measure of likelihood of an event to occur. We use probability to understand the
random events. Using probability we can find out how likely an event is about to occur
or how likely it won’t occur.
Data collection is the main pillar of statistics where we collect the data on which we
have to perform the operation. Better data collection helps in better and accurate
results. Data collection and sampling process reliable and accurate prediction. It
include techniques like surveys, experiments and observational studies.
Data Interpretation:
Problem Solving
Predictive Analysis
• Using the inferential statistics we can make prediction about future of events
and the results associated with it and based on these results we can plan our
future events accordingly. It is also used in forecasting of events.
Quality Control
• In industries, statistical methods are used for quality control and improvement
processes, ensuring products and services meet certain standards and
specifications.
Resource Optimization
Risk Assessment
Career Opportunities
Statistics in Healthcare
It has a very vital role in healthcare sector where statistics is used in medical research.
It help in understanding the pattern of diseases based on data. Statistics is used at a
huge scale in field of medical science to predict diseases, analyze diseases patterns,
research on medicines, performance of medicines on different data. Basically medical
science use huge amount of data and then based on statistical methods they analyze
the data and find helpful insights for above mentioned purposes.
Statistics in Education
Statistics is hugely used in the field of computer science and engineering specially in
domains like Machine Learning and artificial intelligence. Not only AI and ML it is also
used in fields like deep learning, computer vision and all. Statistics is main pillar of
machine learning models where it help model to preprocess the data and produce
accurate prediction based on given data set.
Risk management and assessment involve identifying potential risks, evaluating their
likelihood and impact, and implementing strategies to mitigate or manage them. This
process is fundamental to maintaining organizational stability and achieving objectives.
Key components include risk identification, where potential issues are recognized; risk
analysis, which quantifies the probability and potential impact of these risks using
statistical methods such as probability distributions and regression analysis; and risk
evaluation, which prioritizes risks based on their significance. Techniques such as
Monte Carlo simulations, fault tree analysis, and sensitivity analysis are often used to
model risk scenarios and develop mitigation strategies. By systematically assessing and
managing risks, organizations can minimize negative outcomes and capitalize on
opportunities, ensuring long-term success and resilience.
Statistics has a deep impact on society. Data analysis which are done using statistical
methods are useful for determining the outcome and future discourse of the society
In public policy , statistics is used a lot for formulation, evaluating and evaluation of
these policies. The policymakers used statistical data to understand and analyze the
ongoing issues in the surrounding. Using statistical tools only they are able to allocate
resources to people efficiently, and measure the impact of the legislative actions. For
example, census data inform decision on healthcare or education or infrastructure,
ensuring that policies address the need of such a diverse population and this is only
possible because of the statistical data they use. By providing evidence-based insights,
statistics help create fair and effective policies that promote societal well-being.
Social research and surveys are crucial for gathering data on human behavior, attitudes,
and social conditions. These methods employ statistical techniques to analyze survey
responses, providing a comprehensive understanding of societal trends and issues. For
example, surveys on public health can reveal patterns in disease prevalence, guiding
public health interventions. Social research supports the development of theories and
policies aimed at improving social outcomes and addressing inequalities.
The various methods and tools used in statistics are mentioned below:
Descriptive Statistics
The descriptive statistics is used to describe and evaluate the main numerical features
of the data we have. This includes:
Measure of central tendency: This provides the central point of the data set. In order to
provide the central tendency we use methods like Mean, Median, Mode, Variance and
Standard deviation
These summarize and describe the main features of a dataset, such as measures of
central tendency (mean, median, mode) and measures of variability (standard
deviation, variance).
Inferential Statistics
Inferential statistics is the another part of statistics which enable us to make prediction
or inferences about a large dataset or large population based on a sample data taken
from the whole population. It includes methods like:
These methods used for drawing conclusions that extend beyond the immediate data,
allowing for generalized findings and informed predictions.
Just like every other thing statistics also have its disadvantages if it is not used properly.
Using statistical value without proper knowledge can lead to huge mistake and
completely different and wrong result.
Common Misconceptions
• Misconception: Assuming that because two variables are correlated, one causes
the other.
• Example: Ice cream sales and drowning incidents may both rise during the
summer, but increased ice cream sales do not cause drownings; warmer
weather is a common factor.
Small Sample Sizes:
• Misconception: Assuming that small sample can led to better result and easy
calculation
Overgeneralization:
• Misconception: Making a conclusion for all the data based on a very small
sample size.
• Reality: Results should be interpreted within the context and for the complete
data based on better and relatable sample size.
• Example: A study conducted on college students may not apply to the general
adult population.
Survivorship Bias:
• Reality: This bias can lead to inaccurate conclusions because it doesn’t consider
the full dataset.
• Reliability: Ensure the data comes from a credible and unbiased source.
• Sample Size: Verify that the sample size is sufficiently large to provide insights.
• Sampling Method: Ensure that the sampling method used is appropriate and
unbiased.
• Chart Types: Consider whether the chart type accurately represents the data. Pie
charts, for instance, can be misleading with too many categories.
• Logical Inferences: We need to check if the given result follow the data logically..
• Evidence Support: Ensure the data supports the conclusions drawn and that
alternative explanations have been considered.
In statistics, internal data comes from within an organization, while external data comes
from outside the organization:
• Internal data
Data that is generated and used within a company or organization. This data can come
from areas such as operations, maintenance, personnel, and finance. Examples of
internal data include expense reports, cash flow reports, production reports, and
budget variance analysis. Internal data is usually stored in spreadsheets, databases, or
customer relationship management (CRM) systems.
• External data
Data that is collected outside an organization from areas like press releases, statistics
departments, government databases, and market research. Examples of external data
include market research reports, social media data, and government data.
Internal data is free to the company, and it can be very relevant and telling. External data
can be purchased from third-party providers or gathered from publicly available
sources.
In statistics, **internal** and **external sources of data** refer to the origin from where
data is collected for analysis.
These are data collected from within the organization or system that is being studied.
Internal sources tend to be more specific and relevant to the particular needs of the
entity collecting the data. Examples include:
These are data obtained from sources outside the organization. External data is often
used to complement or enrich internal data. Examples include:
5. **Public databases** – Open-source data from organizations like the World Bank,
WHO, or Eurostat.
6. **Social media and web data** – Data from online platforms, user behavior, and
engagement metrics.
Both internal and external sources play critical roles in forming a comprehensive
dataset for statistical analysis. Internal data provides direct insights from within the
organization, while external data helps in benchmarking and understanding broader
trends.
Frequency distribution is a method of organizing and summarizing data to show the
frequency (count) of each possible outcome of a dataset. It is an essential tool in
statistics for understanding the distribution and pattern of data. There are several types
of frequency distributions used based on the nature of the data and the analysis
required.
It is not always possible for an investigator to easily measure the items of a series or set
of data. To make the data simple and easy to read and analyze, the items of the series
are placed within a range of values or limits. In other words, the given raw set of data is
categorized into different classes with a range, known as Class Intervals. Every item of
the given series is put against a class interval with the help of tally bars. The number of
items occurring in the specific range or class interval is shown under Frequency against
that particular class range to which the item belongs.
The marks of a class of 20 students are 11, 27, 18, 14, 28, 18, 2, 22, 11, 24, 22, 11, 8, 20,
25, 28, 30, 12, 11, 8. Prepare a frequency distribution table for the same.
Solution:
The range of marks of the students is 2- 28. Let us take class intervals 0-5, 5-10, 10-15,
15-20, 20-25, and 25-30.
1. Exclusive Series
2. Inclusive Series
3. Open End Series
1. Exclusive Series
The series with class intervals, in which all the items having the range from the lower
limit to the value just below its upper limit are included, is known as the Exclusive
Series. This kind of frequency distribution is known as an exclusive series because the
frequencies corresponding to the specific class interval do not include the value of its
upper limit. For example, if a class interval is 0-10, and the values of the given series
are 4, 10, 2, 15, 8, and 9, then only 4, 2, 8, and 9 will be included in the 0-10 class
interval. 10 and 15 will be included in the next class interval, i.e., 10-20. Also, the upper
limit of a class interval is the lower limit of the next class interval.
From the above table of exclusive series, it can be seen that the upper limits of the first
class interval is the lower limit of the second class interval, and so on. Also, as
discussed above, if the data includes a value 10, it will be included in the class interval
10-20, not in 0-10.
2. Inclusive Series
The series with class intervals, in which all the items having the range from the lower
limit up to the upper limit are included, is known as Inclusive Series. Like exclusive
series, the upper limit of one class interval does not repeat itself as the lower limit of the
next class interval. Therefore, there is a gap (between 0.1 to 1) between the upper-class
limit of one class interval and the lower limit of the next class interval. For
example, class intervals of an inclusive series can be, 0-9, 10-19, 20-29, 30-39, and so
on. In this case, the gap between the upper limit of one class interval and the lower limit
of the next class interval is 1, and the class intervals do not overlap with each other like
in an exclusive series.
Sometimes it gets difficult to perform statistical analysis with inclusive series. In those
cases, the inclusive series is converted into an exclusive series.
From the above table of inclusive series, it can be seen that the upper limit of one class
interval (say, 9 of interval 0-9) is not the same as the lower limit of the next class interval
(10 of interval 10-19). Also, all the values that come under 0-9, including 0 and 9 are
included in the frequency against 0-9.
The steps for converting an inclusive series into exclusive series are:
• In this first step, calculate the difference between the upper class limit of one
class interval and the lower limit of the next class interval.
• The next step is to divide the difference by two and then add the resulting value to
the upper limit of every class interval and subtract it from the lower limit of every
class interval.
Example:
The inclusive series of the above example is converted into exclusive series as under.
• In Inclusive Series, the upper limit of one class interval is not the same as the
lower limit of the next class interval. There is a gap ranging from 0.1 to 1.0
between the upper class limit of one class interval and the lower class limit of
the next class interval. However, in the Exclusive Series, the upper limit of one
class interval is the same as the lower limit of the next class interval.
• In the case of Inclusive Series, the value of the upper and the lower limit are
included in that class interval only. However, in the case of Exclusive Series, the
value of upper limit of a class interval is not included in that interval, instead, it is
included in the next class interval.
Sometimes the lower limit of the first class interval and the upper class limit of a series
is not available; instead, Less than or Below is mentioned in the former case (in place
of the lower limit of the first class interval), and More than or Above is mentioned in the
latter case (in place of the upper limit of the last class interval). These types of series
are known as Open End Series.
A series whose frequencies are continuously added corresponding to the class intervals
is known as Cumulative Frequency Series.
A simple frequency series can be converted into a cumulative frequency series. There
are two ways through which it can be done. These are as follows:
• Expressing the cumulative frequencies on the basis of the upper limits of the
class intervals. For example, expressing 10-20, 20-30, and 30-40 as Less than
20, Less than 30, and Less than 40.
Convert the following simple frequency series into a cumulative frequency series using
both ways.
Solution:
To attain the frequency against a specific class interval of a cumulative frequency series,
it can be converted into a simple frequency series.
Example:
The series in which, instead of class intervals, their mid-values are given with the
corresponding frequencies, is known as Mid-Value Frequency Series.
The steps to convert a mid-value frequency series into a simple frequency series are as
follows:
• The first step is to determine the mutual difference between the mid-values.
• The last step of conversion is to subtract the resulting figure from the second
step from the mid-value to get the lower limit of the class interval, and add the
resulting figure from the second step to the mid-value to get the upper limit.
m = Mid-Value
Convert the following Mid-Value Frequency Series into Simple Frequency Series.
Solution:
Calculation:
When the classes of a series are of the same interval, it is known as Equal Class Interval
Series.
When the classes of a series are of unequal interval, it is known as Equal Class Interval
Series.
Simple Arithmetic Mean gives equal importance to all the variables in a series. However,
in some situations, a greater emphasis is given to one item and less to others, i.e.,
ranking of the variables is done according to their significance in that situation. For
example, during inflation, the price of everything in an economy tends to rise, but
households pay more importance to the rise in the price of necessary food items rather
than the rise in the price of clothes. In other words, more significance is given to the
price of food and less to the price of clothes. This is when Weighted Arithmetic
Mean comes into the picture.
When every item in a series is assigned some weight according to its significance, the
average of such series is called Weighted Arithmetic Mean.
Here, weight stands for the relative importance of the different variables. In simple
words, the Weighted Arithmetic Mean is the mean of weighted items and is also known
as the Weighted Average Mean.
Weighted Arithmetic Mean is calculated as the weighted sum of the items divided by the
sum of the weights.
• Step-1: All the items (X) in a series are weighted according to their significance.
Weights are denoted as ‘W’.
• Step-2::Add up all the values of weights ‘W’ to get the sum total of weights, i.e.,
∑W= W1+W2+W3+…………+Wn
• Step-3:Items (X) are multiplied by the corresponding weights (W) to get ‘XW’.
• Step-4:Add up all the values of ‘XW’ to get the sum total of the product ‘XW’, i.e.,
∑XW= X1W1+X2W2+X3W3+…………..+XnWn
• Step-5: To get the weighted mean, divide the weighted sum of the items ‘∑XW’ by
the sum of weights ‘∑W’.
Example:
Weight (W) 8 4 5 10 7 6
Solution:
5 8 40
10 4 40
25 5 125
20 10 200
25 7 175
30 6 180
∑W=40 ∑XW=760
Weighted Mean =
= 760/40
= 19
Explanation:
1. Multiply each item with its corresponding weight to get XW, i.e.,
2. Add up all the values of weight to get the sum of weights, i.e.,
∑W= 8 + 4 + 5 + 10 + 7 + 6 = 40
3. Add up all the values of the product of weight and items(XW) to get the sum of
the product, i.e.,
Mean, Median, and Mode are measures of the central tendency. These values are used
to define the various parameters of the given data set. The measure of central tendency
(Mean, Median, and Mode) gives useful insights about the data studied, these are used
to study any type of data such as the average salary of employees in an organization, the
median age of any class, the number of people who plays cricket in a sports club, etc.
Measure of central tendency is the representation of various values of the given data
set. There are various measures of central tendency and the most important three
measures of central tendency are:
• Mean
• Median
• Mode
Mean, median, and mode are measures of central tendency used in statistics to
summarize a set of data.
Mean (x̅ or μ): The mean, or arithmetic average, is calculated by summing all the values
in a dataset and dividing by the total number of values. It’s sensitive to outliers and is
commonly used when the data is symmetrically distributed.
Median (M): The median is the middle value when the dataset is arranged in ascending
or descending order. If there’s an even number of values, it’s the average of the two
middle values. The median is robust to outliers and is often used when the data is
skewed.
Mode (Z): The mode is the value that occurs most frequently in the dataset. Unlike the
mean and median, the mode can be applied to both numerical and categorical data. It’s
useful for identifying the most common value in a dataset.
What is Mean?
Mean is the sum of all the values in the data set divided by the number of values in the
data set. It is also called the Arithmetic Average. Mean is denoted as x̅ and is read as x
bar.
Mean Symbol
The symbol used to represent the mean, or arithmetic average, of a dataset is typically
the Greek letter “μ” (mu) when referring to the population mean, and “x̄” (x-bar) when
referring to the sample mean.
These symbols are commonly used in statistical notation to represent the average
value of a set of data points.
Mean Formula
If x1, x2, x3,……, xn are the values of a data set then the mean is calculated as:
x̅ = (x1 + x2 + x3 + . . . + xn) / n
Example: Find the mean of data sets 10, 30, 40, 20, and 50.
Solution:
Mean for the grouped data can be calculated by using various methods. The most
common methods used are discussed in the table below:
Direct Method Assumed Mean Method Step Deviation Method
x̅ = a + h∑ fixi / ∑ fi
x̅ = a + ∑ fixi / ∑ fi
x̅ = ∑ fixi / ∑ fi Where,
Where,
a is Assumed mean
Where, a is Assumed mean
ui = (xi – a)/h
∑fi is the sum of all di is equal to xi – a
h is Class size
frequencies ∑fi the sum of all
∑fi the sum of all
frequencies
frequencies
What is Median?
A Median is a middle value for sorted data. The sorting of the data can be done either in
ascending order or descending order. A median divides the data into two equal halves.
The formula to calculate the median of the number of terms if the number of terms is
even is shown in the image below:
The formula to calculate the median of the number of terms if the number of terms is
odd is shown in the image below:
Median Formula for Odd Terms
Median Symbol
The letter “M” is commonly used to represent the median of a dataset, whether it’s for a
population or a sample. This notation simplifies the representation of statistical
concepts and calculations, making it easier to understand and apply in various
contexts. Therefore, in Indian statistical practice, “M” is widely accepted and
understood as the symbol for the median.
Median Formula
If the number of values (n value) in the data set is odd then the formula to calculate the
median is:
If the number of values (n value) in the data set is even then the formula to calculate the
median is:
Example: Find the median of given data set 30, 40, 10, 20, and 50.
Solution:
Step 2: Check n (number of terms of data set) is even or odd and find the median of the
data with respective ‘n’ value.
= 30
The median of the grouped data median is calculated using the formula,
where
• n is number of observations
• h is class size
What is Mode?
A mode is the most frequent value or item of the data set. A data set can generally have
one or more than one mode value. If the data set has one mode then it is called “Uni-
modal”. Similarly, If the data set contains 2 modes then it is called “Bimodal” and if the
data set contains 3 modes then it is known as “Trimodal”. If the data set consists of
more than one mode then it is known as “multi-modal”(can be bimodal or trimodal).
There is no mode for a data set if every number appears only once.
Symbol of Mode
In statistical notation, the symbol “Z” is commonly used to represent the mode of a
dataset. It indicates the value or values that occur most frequently within the dataset.
This symbol is widely utilised in statistical discourse to signify the mode, enhancing
clarity and precision in statistical discussions and analyses.
Mode Formula
Solution:
Mode = 2
where,
For any group of data, the relation between the three central tendencies mean, median,
and mode is shown in the image below:
Mean, Median and Mode: Another name for this relationship is an empirical
relationship. When we know the other two measures for a given set of data, this is used
to find one of the measures. The LHS and RHS can be switched to rewrite this
relationship in various ways.
What is Range?
In a given data set the difference between the largest value and the smallest value of the
data set is called the range of data set. For example, if height(in cm) of 10 students in a
class are given in ascending order, 160, 161, 167, 169, 170, 172, 174, 175, 177, and 181
respectively. Then range of data set is (181 – 160) = 21 cm.
Range of Data
Range is the difference between the highest value and the lowest value. It is a way to
understand how the numbers are spread in a data set. The range of any data set is easily
calculated by using the formula given in the image below:
Range Formula
The formula to find the Range is:
Example: Find the range of the given data set 12, 19, 6, 2, 15, 4.
Solution:
Here,
Lowest Value = 2
Highest Value = 19
Range = 19 − 2 = 17
Median is not
Mean is sensitive to Mode is not
sensitive to
outliers. sensitive to outliers.
Sensitivity outliers .
Calculated by adding
Calculated by
up all values of a Calculated by
finding which value
dataset and dividing finding the
occurs more
them by the total middle value in a
number of times in
number of values in list of data.
a dataset.
Calculation dataset.
Feature Mean Median Mode
Value of median
Value of mode is
Value of mean may or is always a value
also always a value
may not be in dataset. from the
from the dataset.
Representation dataset.
Mean = (20000+20000+20000+20000+20000+20000+20000+20000+35000)/9 =
195000/9 = 21666.67
For median, in ascending order: 20000, 20000, 20000, 20000, 20000, 20000, 20000,
20000, 35000.
n = 9,
Thus, (9 + 1)/2 = 5
Median = 20000
Mode = 20,000.
In our daily life we came across various instances where we have to use the concept of
mean, median and mode. There are various application of mean, median and mode,
here’s how they link to real life:
• Mode: Mode represents the most frequently occurring value in a dataset and is
used in scenarios where identifying the most common value is important. For
example, in manufacturing, the mode may be used to identify the most common
defect in a production line to prioritize quality control efforts
Question 1: Study the bar graph given below and find the mean, median, and mode
of the given data set.
Solution:
Mean = (5 + 7 + 9 + 6) / 4
= 27 / 2
= 6.75
Order the given data in ascending order as: 5, 6, 7, 9
Median = (6 + 7) / 2
= 6.5
Range = 9 – 5
=4
Question 2: Find the mean, median, mode, and range for the given data
190, 153, 168, 179, 194, 153, 165, 187, 190, 170, 165, 189, 185, 153, 147, 161, 127, 180
Solution:
For Mean:
190, 153, 168, 179, 194, 153, 165, 187, 190, 170, 165, 189, 185, 153, 147, 161, 127, 180
Number of observations = 18
= (190+153+168+179+194+153+165+187+190+170+165+189+185+153+147
+161+127+180) / 18
= 2871/18
= 159.5
For Median:
127, 147, 153, 153, 153, 161, 165, 165, 168, 170, 179, 180, 185, 187, 189, 190, 190, 194
Here, n = 18
For Mode:
Thus, mode = 53
For Range:
Question 3: Find the Median of the data 25, 12, 5, 24, 15, 22, 23, 25
Solution:
Step 2: Check n (number of terms of data set) is even or odd and find the median of the
data with respective ‘n’ value.
= (22+23) / 2
= 22.5
Question 4: Find the mode of given data 15, 42, 65, 65, 95.
Solution:
Mode = 65
Question 2: Find the median of the following data set: 12, 15, 20, 9, 17, 25, 10.
Question 3: A survey collected the number of books read by a group of 10 people last
year: 5, 7, 6, 5, 9, 7, 8, 5, 10, 6. What is the mode of the data set?
Question 4: In a classroom, the scores (out of 100) for a test are: 56, 78, 67, 45, 56, 90,
56, 67, 78, 82. Find the mean, median, and mode of the scores.
Question 5: In a skewed distribution the mean of the data is 40 and median of the data
is 35. Calculate the mode of the data set.
Conclusion
Mean, Median and Mode are essential statistical measures of central tendency that
provide different perspectives on data sets. The mean provides a general average,
making it useful for evenly distributed data. The median gives a middle value, providing
a better view of central tendency when dealing with skewed distributions or extreme
values and, the mode highlights the most frequent value, making it valuable in
categorical data analysis.
The previous statistical approaches (such as central tendency and dispersion) are
limited to analysing a single variable or statistical analysis. This type of statistical
analysis in which one variable is involved is known as Univariate Distribution.
However, there are instances in real-world situations where distributions have two
variables like data related to income and expenditure, prices and demand, height and
weight, etc. The distribution with two variables is referred to as Bivariate Distribution. It
is necessary to uncover relationships between two or more statistical
series. Correlation is a statistical technique for determining the relationship between
two variables.
According to L.R. Connor, “If two or more quantities vary in sympathy so that
movements in one tend to be accompanied by corresponding movements in others,
then they are said to be correlated.”
Table of Content
• What is Correlation?
• Significance of Correlation
• Types of Correlation
o 1. Positive Correlation:
o 2. Negative Correlation:
o 1. Linear Correlation:
o 1. Simple Correlation:
o 2. Partial Correlation:
o 3. Multiple Correlation:
• Degree of Correlation
o 1. Perfect Correlation:
o 2. Zero Correlation:
What is Correlation?
A statistical tool that helps in the study of the relationship between two variables is
known as Correlation. It also helps in understanding the economic behaviour of the
variables.
The two variables are said to be correlated if a change in one causes a corresponding
change in the other variable. For example, A change in the price of a commodity leads
to a change in the quantity demanded. An increase in employment levels increases the
output. When income increases, consumption increases as well.
The degree of correlation between various statistical series is the main subject of
analysis in such circumstances.
The degree of correlation between two or more variables can be determined using
correlation. However, it does not consider the cause-and-effect relationship between
variables. If two variables are correlated, it could be for any of the following reasons:
1. Third-Party Influence:
The influence of a third party can result in a high degree of correlation between the two
variables. This analysis does not take into account third-party influence. For
example, the correlation between the yield per acre of grain and jute can be of a high
degree because both are linked to the amount of rainfall. However, in reality, both these
variables do not have any effect on each other.
It may be challenging to determine which is the cause, and which is the effect when two
variables indicate a high degree of correlation. It is so because they may be having an
impact on one another. For example, when there is an increase in the price of a
commodity, it increases its demand. Here, the price is the cause, and demand is the
effect. However, there is a possibility that the price of the commodity will rise due to
increased demand (population growth or other factors). In that case, increased demand
is the cause, and the price is the effect.
3. Pure Chance:
It is possible that the correlation between the two variables was obtained by random
chance or coincidence alone. This correlation is also known as spurious. Therefore, it is
crucial to determine whether there is a possibility of a relationship between the
variables under analysis. For example, even if there is no relationship between the two
variables (between the income of people in a society and their clothes size), one may
see a strong correlation between them.
So, it can be said that correlation provides only a quantitative measure and does not
indicates cause and effect relationship between the variables. For that reason, it must
be ensured that variables are correctly selected for the correlation analysis.
Significance of Correlation
1. It helps determine the degree of correlation between the two variables in a single
figure.
3. When two variables are correlated, the value of one variable can be estimated
using the value of the other. This is performed with the regression coefficients.
4. In the business world, correlation helps in taking decisions. The correlation helps
in making predictions which helps in reducing uncertainty. It is so because the
predictions based on correlation are probably reliable and close to reality.
Types of Correlation
Based on the direction of change in the value of two variables, correlation can be
classified as:
1. Positive Correlation:
When two variables move in the same direction; i.e., when one increases the other also
increases and vice-versa, then such a relation is called a Positive Correlation. For
example, Relationship between the price and supply, income and expenditure, height
and weight, etc.
2. Negative Correlation:
When two variables move in opposite directions; i.e., when one increases the other
decreases, and vice-versa, then such a relation is called a Negative Correlation. For
example, the relationship between the price and demand, temperature and sale of
woollen garments, etc.
Based on the ratio of variations between the variables, correlation can be classified
as:
1. Linear Correlation:
When there is a constant change in the amount of one variable due to a change in
another variable, it is known as Linear Correlation. This term is used when two
variables change in the same ratio. If two variables that change in a fixed proportion are
displayed on graph paper, a straight- line will be used to represent the relationship
between them. As a result, it suggests a linear relationship.
In the above graph, for every change in the variable X by 5 units there is a change of 10
units in variable Y. The ratio of change of variables X and Y in the above schedule is 1:2
and it remains the same, thus there is a linear relationship between the variables.
When there is no constant change in the amount of one variable due to a change in
another variable, it is known as a Non-Linear Correlation. This term is used when two
variables do not change in the same ratio. This shows that it does not form a straight-
line relationship. For example, the production of grains would not necessarily increase
even if the use of fertilizers is doubled.
In the above schedule, there is no specific relationship between the variables. Even
though both change in the same direction i.e. both are increasing, they change in
different proportions. The ratio of change of variables X and Y in the above schedule is
not the same, thus there is a non-linear relationship between the variables.
1. Simple Correlation:
Simple correlation implies the study between the two variables only. For example, the
relationship between price and demand, and the relationship between price and money
supply.
2. Partial Correlation:
Partial correlation implies the study between the two variables keeping other variables
constant. For example, the production of wheat depends upon various factors like
rainfall, quality of manure, seeds, etc. But, if one studies the relationship between
wheat and the quality of seeds, keeping rainfall and manure constant, then it is a partial
correlation.
3. Multiple Correlation:
Multiple correlation implies the study between three or more three variables
simultaneously. The entire set of independent and dependent variables is studied
simultaneously. For example, the relationship between wheat output with the quality of
seeds and rainfall.
Degree of Correlation
The degree of correlation is measured through the coefficient of correlation. The degree
of correlation for the given variables can be expressed in the following ways:
1. Perfect Correlation:
If the relationship between the two variables is in such a way that it varies in
equal proportion (increase or decrease) it is said to be perfectly correlated. This can be
of two types:
2. Zero Correlation:
There is a situation with a limited degree of correlation between perfect and absence of
correlation. In real life, it was found that there is a limited degree of correlation.
• Correlation is limited negative when there are unequal changes in the opposite
direction.
• Correlation is limited and positive when there are unequal changes in the same
direction.
• The degree of correlation can be low (when the coefficient of correlation lies
between 0 and 0.25), moderate (when the coefficient of correlation lies between
0.25 and 0.75), or high (when the coefficient of correlation lies between 0.75 and
1).
Coefficient of Correlation:
- **Values of \( r \)**:
- **\( r = +1 \)**: Perfect positive correlation (as one variable increases, the other
increases proportionally).
- **\( r = -1 \)**: Perfect negative correlation (as one variable increases, the other
decreases proportionally).
1. **Positive Correlation**: If \( r > 0 \), it indicates that as one variable increases, the
other tends to increase as well. For example, height and weight typically have a positive
correlation.
2. **Negative Correlation**: If \( r < 0 \), it indicates that as one variable increases, the
other tends to decrease. For example, as the price of a product increases, the demand
for it may decrease.
\[
r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum
y)^2]}}
\]
Where:
- \( \sum x^2 \) and \( \sum y^2 \) are the sums of squared \( x \) and \( y \) values
### Interpretation:
### Importance:
- The coefficient of correlation helps in predicting the behavior of one variable based on
the other.
- It is widely used in fields like economics, finance, biology, and social sciences to study
relationships between variables. For instance, correlation can help in determining how
GDP growth impacts employment rates or how study hours affect exam scores.
Regression analysis
- Example: Predicting future stock prices based on historical prices and market
conditions, or estimating the growth of a country's GDP based on factors like
investment levels and education.
3. **Causal Inference**:
- Example: Analyzing how increased advertising leads to higher sales or whether higher
education levels reduce poverty rates.
- Regression analysis is useful for identifying trends over time. Time series regression
models can be used to understand how variables evolve over time and predict future
values.
- Regression models can help optimize processes by identifying key factors that
influence performance or outcomes. This can support data-driven decision-making.
6. **Hypothesis Testing**:
- In cases where multiple variables may influence an outcome, regression allows for
controlling confounding factors. This ensures that the observed relationship between
the independent and dependent variables is not biased by other variables.
- Example: In medical studies, regression might control for age, gender, and pre-
existing conditions when examining the effectiveness of a new drug.
8. **Risk Assessment**:
- Regression models are used in risk management and insurance industries to assess
the likelihood of certain outcomes based on historical data. It helps in estimating risk
exposure and setting premiums.
- Example: In finance, regression models are used to predict the risk of default on
loans by analyzing factors like income, credit score, and debt levels.
- **Polynomial Regression**: Used when the relationship between the dependent and
independent variable is non-linear (e.g., modeling the growth of a population over time).
- **Ridge and Lasso Regression**: These are types of regularized regression methods
that prevent overfitting by adding penalties to the regression model.
Using multiple linear regression, the company could develop a model to predict sales.
This helps the company decide how much to spend on advertising and what price
points to set in order to maximize revenue.
**Correlation** and **regression analysis** are both statistical tools used to study the
relationship between variables, but they have different purposes and characteristics.
Here's a comparison of the two:
### 1. **Purpose**:
### 2. **Focus**:
- **Correlation**: Focuses on the degree to which two variables are related, without
making assumptions about dependence or independence. It treats both variables
equally and does not distinguish between dependent and independent variables.
### 3. **Directionality**:
### 4. **Output**:
### 5. **Causality**:
- **Correlation**: Does not imply causality. Even if two variables are strongly
correlated, it doesn't mean that one variable causes the other.
### 6. **Symmetry**:
- **Correlation**: Is **symmetric**. The correlation between \( X \) and \( Y \) is the
same as the correlation between \( Y \) and \( X \).
### 8. **Interpretation**:
- **Correlation**: Simply indicates whether and how strongly two variables are related
(positive or negative), but doesn’t specify the magnitude of change in one variable due
to change in the other.
### 9. **Applications**:
- **Correlation**:
\[
r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 -
(\sum y)^2]}}
\]
\[
Y = a + bX
\]
Where:
|--------------------------|----------------------------------------|----------------------------------------|
These two lines are usually not the same unless the correlation coefficient \( r = \pm 1
\). Below are the equations for each line:
This equation gives the predicted value of \( Y \) for a given value of \( X \). It is generally
of the form:
\[
Y = a + bX
\]
Where:
\[
b=n(∑X2)−(∑X)2n(∑XY)−(∑X)(∑Y)
\]
\[
a=n∑Y−b(∑X)
\]
This equation gives the predicted value of \( X \) for a given value of \( Y \). It is of the
form:
\[
X = c + dY
X=c+dY
\]
Where:
- \( Y \) = Independent variable,
\[
d=n(∑Y2)−(∑Y)2n(∑XY)−(∑X)(∑Y)
\]
\[
c=n∑X−d(∑Y)
\]
These regression lines help in making predictions and understanding how changes in
one variable affect another. However, the two lines are typically not the same, reflecting
the difference in how \( Y \) affects \( X \) versus how \( X \) affects \( Y \).