STATISTICS NOTES DIPLOMA
STATISTICS NOTES DIPLOMA
1. Introduction to Statistics
2. Data Collection
3. Data Representation
5. Measures of Dispersion
Range
Quartiles and interquartile range
Variance and standard deviation
Coefficient of variation
6. Probability
8. Hypothesis Testing
Introduction to econometrics
Model specification and estimation
Assumptions of the classical linear regression model
1. Introduction to Statistics
Introduction to Statistics
Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It
provides tools and methodologies to make informed decisions based on data.
Statistics: Derived from the Latin word status, meaning state, reflecting its historical use in
government and administration.
Modern definition: The branch of mathematics dealing with the collection, analysis,
interpretation, presentation, and organization of data.
Scope:
o Descriptive Statistics: Summarizing and presenting data in a meaningful way.
o Inferential Statistics: Making predictions, decisions, or generalizations about a
population based on sample data.
Importance of Statistics
Used across diverse fields such as business, healthcare, social sciences, engineering, and
government.
Helps in decision-making under uncertainty.
Aids in designing experiments and surveys.
Enables the identification of trends and patterns in data.
2. Data Collection
Data Collection
Data collection is the process of gathering information in a structured manner to analyze and
draw conclusions. Accurate data collection is essential for the validity of statistical analysis.
Types of Data
1. Primary Data:
o Definition: Data collected directly by the researcher for a specific purpose.
o Examples: Surveys, interviews, experiments.
o Advantages: Tailored to specific needs, more accurate and reliable.
o Disadvantages: Time-consuming and expensive.
2. Secondary Data:
o Definition: Data collected by someone else, used for analysis.
o Examples: Government reports, research articles, company records.
o Advantages: Readily available, cost-effective, saves time.
o Disadvantages: May not fully meet the researcher’s needs, could be outdated or biased.
1. Survey Method
Questionnaires:
o Structured (closed-ended questions).
o Unstructured (open-ended questions).
Interviews:
o Face-to-face, telephonic, or online.
Advantages: Captures responses directly from the target audience.
Disadvantages: May suffer from biases (e.g., interviewer bias, respondent bias).
2. Observation Method
Types:
o Direct Observation: Observing behavior or events directly.
o Indirect Observation: Analyzing recorded data or traces of activity.
Advantages: Provides real-time, unbiased data.
Disadvantages: Limited to observable phenomena; time-intensive.
3. Experimental Method
Use of existing records or documents such as books, journals, financial reports, and government
publications.
Advantages: Cost-effective and provides historical perspectives.
Disadvantages: Limited to what has already been recorded; potential biases in the original
source.
5. Focus Groups
Sampling Methods
Sampling is the process of selecting a subset (sample) from a population to represent the whole.
Simple Random Sampling: Every individual has an equal chance of being selected.
Stratified Sampling: Population divided into strata (groups), and samples taken from each
stratum.
Cluster Sampling: Population divided into clusters, and entire clusters are randomly selected.
Systematic Sampling: Selecting every nth item from a list after a random start.
Conclusion
Effective data collection is the foundation of reliable statistical analysis. Selecting the right
method and ensuring accuracy are critical for drawing meaningful conclusions.
2. Variable:
o Definition: A characteristic or attribute that can vary.
o Types:
Quantitative Variables: Numeric values (e.g., height, weight, age).
Discrete: Countable values (e.g., number of children).
Continuous: Any value within a range (e.g., height in cm).
Qualitative Variables: Non-numeric categories (e.g., gender, color).
3. Data:
o Definition: Collected facts or information.
o Types:
Primary Data: Collected directly by the researcher.
Secondary Data: Collected from existing sources.
o Levels of Measurement:
Nominal: Categories without order (e.g., gender).
Ordinal: Categories with order (e.g., rankings).
Interval: Numeric data without a true zero (e.g., temperature).
Ratio: Numeric data with a true zero (e.g., income).
Branches of Statistics
1. Descriptive Statistics:
o Focus: Summarizing data.
o Tools: Tables, graphs, measures of central tendency (mean, median, mode), and
measures of dispersion (range, variance, standard deviation).
2. Inferential Statistics:
o Focus: Making inferences about a population based on sample data.
o Tools: Hypothesis testing, confidence intervals, and regression analysis.
Applications of Statistics
Key Takeaways
3. Data Representation
Data representation involves organizing and displaying data in a meaningful way to facilitate
understanding and analysis. Proper representation helps identify trends, patterns, and
relationships within the data.
1. Tabular Representation
Example:
0–10 5 5
11–20 10 15
21–30 8 23
2. Graphical Representation
a. Bar Chart
b. Histogram
c. Pie Chart
A circular chart divided into sectors, where each sector represents a proportion of the total.
Features:
o Useful for showing relative percentages.
o Each sector’s angle = (Frequency / Total Frequency) × 360°.
d. Line Graph
e. Scatter Plot
Summarizes a dataset’s distribution using five statistics: minimum, first quartile (Q1), median,
third quartile (Q3), and maximum.
Use:
o To identify outliers and variability.
3. Numerical Summaries
Descriptive Measures:
o Central Tendency: Mean, Median, Mode.
o Dispersion: Range, Variance, Standard Deviation.
Summarizes the data in a concise way.
1. Type of Data:
o Categorical: Bar chart, pie chart.
o Numerical:
Discrete: Histogram, bar chart.
Continuous: Line graph, scatter plot.
2. Objective:
o Trends: Line graph.
o Comparisons: Bar chart.
o Relationships: Scatter plot.
o Proportions: Pie chart.
Conclusion
Data representation is a critical step in data analysis. By choosing appropriate methods, data can
be visualized effectively, making it easier to interpret and draw conclusions.
4. Measures of Central Tendency
Measures of central tendency describe a dataset by identifying a central point around which the
data are distributed. They summarize the data into a single representative value.
1. Mean
The mean is the sum of all data points divided by the number of data points.
Formula:
Example:
Advantages:
Disadvantages:
Sensitive to outliers. For example, in 10,20,30,40,50010, 20, 30, 40, 50010,20,30,40,500, the
mean becomes 120120120, which does not represent the central tendency well.
2. Median
The median is the middle value in a sorted dataset. If the dataset has an even number of values,
the median is the average of the two middle values.
Steps to Calculate:
Example:
Advantages:
Disadvantages:
3. Mode
The mode is the value that occurs most frequently in the dataset. A dataset may have:
Example:
Advantages:
Simple to identify.
Applicable to categorical data.
Disadvantages:
1. Weighted Mean
2. Geometric Mean
3. Harmonic Mean
Median When data is skewed or contains outliers. Ignores data beyond the middle.
Mode For categorical or discrete data. May not exist or may not be unique.
Choosing the Right Measure
Conclusion
Measures of central tendency are fundamental tools for summarizing data. Choosing the
appropriate measure depends on the dataset and the context of the analysis.
5. Measures of Dispersion
Measures of dispersion describe the spread or variability of data in a dataset. They indicate how
much the data points differ from each other and from the central tendency.
1. Range
The range is the simplest measure of dispersion, representing the difference between the highest
and lowest values.
Formula:
Example:
Advantages:
Simple to calculate.
Gives a quick sense of variability.
Disadvantages:
The IQR measures the spread of the middle 50% of the data by calculating the difference
between the third quartile (Q3Q_3Q3) and the first quartile (Q1Q_1Q1).
Formula:
Example:
Advantages:
Disadvantages:
3. Variance
Variance measures the average squared deviation of each data point from the mean.
Formulas:
For a population:
For a sample:
Where:
xix_ixi = Each data point.
μ\muμ = Population mean.
xˉ\bar{x}xˉ = Sample mean.
NNN = Population size.
nnn = Sample size.
Example:
Advantages:
Disadvantages:
4. Standard Deviation
Standard deviation is the square root of variance and represents the average deviation of data
points from the mean.
Formulas:
For a population:
For a sample:
Example:
Advantages:
Expressed in the same units as the data.
Widely used in data analysis.
Disadvantages:
Formula:
Example:
Advantages:
Disadvantages:
Interquartile Range Robust to outliers, focuses on central data Ignores tails of the dataset.
Variance Basis for many statistical methods Hard to interpret (squared units).
Conclusion
Measures of dispersion are essential for understanding the spread of data. While range and IQR
provide quick insights, variance and standard deviation are more comprehensive. The choice of
measure depends on the dataset and analysis goals.
6. Probability
Probability is the branch of mathematics that deals with measuring the likelihood of an event
occurring. It is a fundamental concept in statistics and serves as the basis for inferential analysis.
1. Experiment
Example:
o Rolling a die: S={1,2,3,4,5,6}S = \{1, 2, 3, 4, 5, 6\}S={1,2,3,4,5,6}
o Flipping a coin: S={Head, Tail}S = \{\text{Head, Tail}\}S={Head, Tail}
3. Event (EEE)
A subset of the sample space; one or more outcomes of interest.
Example:
o Rolling an even number on a die: E={2,4,6}E = \{2, 4, 6\}E={2,4,6}
4. Probability (PPP)
The measure of how likely an event is to occur, expressed as a number between 0 and 1.
Formula:
Types of Events
1. Simple Event
2. Compound Event
3. Independent Events
Events where the outcome of one does not affect the other.
4. Dependent Events
6. Complementary Events
Rules of Probability
1. Addition Rule
2. Multiplication Rule
Independent Events:
Dependent Events:
Where P(B∣A)P(B | A)P(B∣A) is the probability of BBB occurring given AAA has occurred.
3. Complement Rule
1. Classical Probability
2. Empirical Probability
Example: If a coin is flipped 100 times and lands on heads 55 times, P(Head)=55100P(\
text{Head}) = \frac{55}{100}P(Head)=10055.
3. Subjective Probability
Example: Estimating a 70% chance of rain tomorrow based on past weather patterns.
Bayes' Theorem
A formula to find the probability of an event based on prior knowledge of conditions related to
the event.
Formula:
Where:
Conclusion
A random variable is a variable whose value is determined by the outcome of a random event
or experiment. It can take on different values, each associated with a certain probability.
2. Continuous Random Variable: A random variable that can take on any value within a
given range or interval. The possible values are uncountably infinite.
o Example: The height of individuals in a population, which can take any value within a
range (e.g., 150 cm to 200 cm).
2. Probability Distribution
A probability distribution is a function that describes the likelihood of different outcomes for a
random variable. It assigns probabilities to each possible value or range of values of the random
variable.
1. Discrete Probability Distributions
For discrete random variables, the probability distribution is represented by a probability mass
function (PMF). The PMF gives the probability of each possible outcome.
Properties:
o P(X=xi)P(X = x_i)P(X=xi) is the probability that the random variable XXX takes the value
xix_ixi.
o The sum of all probabilities must equal 1: ∑P(X=xi)=1\sum P(X = x_i) = 1∑P(X=xi)=1
Where:
2. Poisson Distribution: Models the number of events that occur in a fixed interval of time
or space when the events happen independently and at a constant average rate.
o Parameter: λ\lambdaλ (average rate of occurrence).
o PMF:
Where:
3. Geometric Distribution: Models the number of trials until the first success in a series of
independent Bernoulli trials.
o PMF:
Where:
Properties:
o The probability that XXX lies within a range a≤X≤ba \leq X \leq ba≤X≤b is the area under
the PDF curve from aaa to bbb.
o The total area under the curve is 1: ∫−∞∞f(x) dx=1\int_{-\infty}^{\infty} f(x) \, dx =
1∫−∞∞f(x)dx=1
o Properties:
Mean, median, and mode are all equal.
The distribution is symmetric about the mean.
Approximately 68% of data falls within one standard deviation of the mean, 95%
within two, and 99.7% within three.
2. Uniform Distribution: A distribution where all outcomes are equally likely within a
given range. The PDF is constant within the range [a,b][a, b][a,b].
o PDF:
3. Exponential Distribution: Models the time between events in a Poisson process (i.e.,
events that occur independently and at a constant average rate).
o PDF:
Where:
The cumulative distribution function (CDF) gives the probability that the random variable
XXX is less than or equal to a certain value xxx. It is used for both discrete and continuous
random variables.
1. Expected Value:
o The expected value is the long-run average or mean of the random variable,
weighted by the probabilities.
o For Discrete Variables:
2. Variance:
o Variance measures the spread of the distribution around the expected value.
o For Discrete Variables:
Conclusion
Random variables and their associated probability distributions are fundamental concepts for
modeling uncertainty in real-world data. Understanding discrete and continuous distributions, as
well as key measures such as expected value and variance, is essential for performing statistical
analysis and making predictions in many fields, including finance, engineering, and natural
sciences.
8. Hypothesis Testing
Hypothesis testing is a statistical method used to make inferences or draw conclusions about a
population based on sample data. It helps evaluate whether a claim or assumption (the
hypothesis) about a population parameter is supported by evidence from the sample.
1.1. Hypotheses
Two-Tailed Test: Tests for any significant difference (e.g., Ha:μ≠50H_a: \mu \neq 50Ha:μ=50).
One-Tailed Test: Tests for a specific direction of difference:
o Right-Tailed Test: Ha:μ>50H_a: \mu > 50Ha:μ>50
o Left-Tailed Test: Ha:μ<50H_a: \mu < 50Ha:μ<50
3. Types of Errors
3. Power of a Test:
The probability of correctly rejecting H0H_0H0 when it is false.
o Power=1−β\text{Power} = 1 - \betaPower=1−β.
4.1. ZZZ-Test
Used for large samples (n>30n > 30n>30) or when population variance is known.
Test Statistic:
Where:
4.2. ttt-Test
Used for small samples (n≤30n \leq 30n≤30) or when population variance is unknown.
Test Statistic:
Where:
Types of ttt-tests:
1. One-Sample ttt-Test: Tests the sample mean against a known value.
2. Independent Two-Sample ttt-Test: Compares means of two independent groups.
3. Paired ttt-Test: Compares means of paired or dependent samples.
Where:
5. PPP-Value Approach
The PPP-value is the probability of observing a test statistic as extreme as, or more extreme than,
the one observed, assuming H0H_0H0 is true.
Scenario: A company claims that the average weight of its product is 500 grams. A sample of 30
products has a mean weight of 495 grams with a standard deviation of 10 grams. Test the claim
at α=0.05\alpha = 0.05α=0.05.
Steps:
1. Hypotheses:
o H0:μ=500H_0: \mu = 500H0:μ=500
o Ha:μ≠500H_a: \mu \neq 500Ha:μ=500
2. Significance Level: α=0.05\alpha = 0.05α=0.05
3. Test Statistic:
Use the ttt-test because the population standard deviation is unknown.
6. Conclusion:
There is sufficient evidence to conclude that the mean weight is not 500 grams.
7. Conclusion
Correlation and regression are statistical tools used to analyze and interpret relationships between
two or more variables. While correlation measures the strength and direction of a relationship,
regression helps predict one variable based on another.
1. Correlation
1.1. Definition
Correlation measures the degree to which two variables are linearly related. It indicates whether
an increase in one variable corresponds to an increase or decrease in another.
The correlation coefficient quantifies the strength and direction of a linear relationship between
two variables.
Formula:
r=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2
\sum (y_i - \bar{y})^2}}r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
Where:
1.4. Limitations
2. Regression
Regression analysis predicts the value of a dependent variable (yyy) based on one or more
independent variables (xxx).
1. Simple Linear Regression: One independent variable predicts one dependent variable.
2. Multiple Linear Regression: Two or more independent variables predict a dependent variable.
3. Non-Linear Regression: Models non-linear relationships.
The relationship between the dependent variable (yyy) and independent variable (xxx) is
modeled as:
Where:
yyy: Predicted value,
xxx: Independent variable,
b0b_0b0: Intercept (value of yyy when x=0x = 0x=0),
b1b_1b1: Slope (change in yyy for a unit change in xxx).
The parameters b0b_0b0 and b1b_1b1 are estimated using the least squares method, which
minimizes the sum of squared errors (differences between observed and predicted values).
Slope (b1b_1b1):
Intercept (b0b_0b0):
The R2R^2R2 value indicates how well the regression line fits the data.
Formula:
Interpretation:
o R2=1R^2 = 1R2=1: Perfect fit (all data points lie on the regression line).
o R2=0R^2 = 0R2=0: No predictive power.
Scenario:
A study examines the relationship between hours of study (xxx) and test scores (yyy) for 10
students.
2 50
4 65
6 70
8 85
10 95
3. Regression equation:
1. Correlation:
o Analyzing relationships in business (e.g., sales vs. advertising expenses).
o Determining associations in health studies (e.g., smoking vs. lung cancer risk).
2. Regression:
o Predicting outcomes (e.g., house prices based on size).
o Identifying key drivers in data (e.g., factors affecting student performance).
8. Conclusion
Correlation provides insights into the strength and direction of relationships between variables,
while regression allows for predictions and deeper understanding of how variables interact.
Together, these tools are essential for statistical modeling and decision-making.
Time series analysis involves studying data points collected or recorded at specific time intervals
to identify patterns, trends, and seasonal variations. It is widely used in fields like finance,
economics, meteorology, and engineering.
1. Trend (T): The long-term movement or direction in the data (upward or downward).
2. Seasonality (S): Regular, periodic fluctuations (e.g., sales increasing during holidays).
3. Cyclic Variations (C): Non-periodic, longer-term oscillations due to economic or business cycles.
4. Irregular Variations (I): Random, unpredictable variations caused by unexpected factors.
3.2. Decomposition
Separating a time series into its components (Trend, Seasonality, Cyclic, Irregular).
Weighted Moving Average (WMA): Assigns weights to data points, giving more importance to
recent values.
Formula:
4.1. Stationarity
A time series is stationary if its properties (mean, variance, and autocorrelation) do not change
over time. Stationarity is essential for many forecasting methods.
4.2. Differencing
Measures the correlation between a time series and its lagged values.
Measures the direct correlation between a time series and its lagged values, controlling for
intermediate lags.
6. Forecasting Models
The AutoRegressive Integrated Moving Average (ARIMA) model is one of the most
commonly used time series forecasting methods.
Components:
o AR (p): Autoregressive component (based on past values),
o I (d): Differencing (to achieve stationarity),
o MA (q): Moving average component (based on past errors).
Notation: ARIMA(p,d,q)ARIMA(p, d, q)ARIMA(p,d,q)
Where P,D,QP, D, QP,D,Q are seasonal components and mmm is the length of the seasonal
cycle.
Scenario:
A retail store tracks monthly sales (in units) for the past two years:
Month Sales
... ...
Use metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE).
9. Conclusion
Time series analysis is a powerful tool for understanding and predicting temporal data. By
identifying patterns and trends, it enables better decision-making in various industries. Mastery
of techniques like decomposition, smoothing, and forecasting models is crucial for accurate and
reliable insights.
Index numbers are statistical measures used to represent changes in a variable or group of
variables over time. They are often used to track price levels, economic activity, or other
measurable phenomena and express them relative to a base value.
3. Value Index: Measures changes in the total monetary value (price × quantity).
4. Special Purpose Index: Created for specific studies, such as stock market indices (e.g.,
S&P 500, Nifty 50).
The base year serves as a benchmark and should be normal, free from unusual fluctuations.
Choose items representative of the phenomenon being measured (e.g., essential goods for CPI).
6. Common Formulas
⋅Q0)∑(P1⋅Q0)×100
PL=∑(P1⋅Q0)∑(P0⋅Q0)×100P_L = \frac{\sum (P_1 \cdot Q_0)}{\sum (P_0 \cdot Q_0)} \times 100PL=∑(P0
Where:
⋅Q1)∑(P1⋅Q1)×100
PP=∑(P1⋅Q1)∑(P0⋅Q1)×100P_P = \frac{\sum (P_1 \cdot Q_1)}{\sum (P_0 \cdot Q_1)} \times 100PP=∑(P0
V=∑(P1⋅Q1)∑(P0⋅Q0)×100V = \frac{\sum (P_1 \cdot Q_1)}{\sum (P_0 \cdot Q_0)} \times 100V=∑(P0⋅Q0
)∑(P1⋅Q1)×100
CPI measures changes in the cost of a fixed basket of goods and services over time.
Formula:
⋅W)∑(P1⋅W)×100
CPI=∑(P1⋅W)∑(P0⋅W)×100CPI = \frac{\sum (P_1 \cdot W)}{\sum (P_0 \cdot W)} \times 100CPI=∑(P0
Where:
Scenario:
A price index is constructed for a basket of three goods with the following data:
Item Base Year Price (P0P_0P0) Current Year Price (P1P_1P1) Quantity (Q0Q_0Q0)
A 20 25 10
B 30 35 15
C 40 50 20
Solution:
2. Interpretation:
The prices have increased by approximately 26.79% compared to the base year.
11. Conclusion
Index numbers are essential tools in statistical analysis for measuring relative changes in
economic and social phenomena. By selecting appropriate methods and understanding their
limitations, they provide meaningful insights into trends and patterns over time.
Statistical Quality Control (SQC) involves the use of statistical methods to monitor and improve
the quality of processes and products. It is a key tool in quality management and ensures that
processes operate efficiently, producing goods or services within acceptable quality standards.
1. Importance of SQC
2. Types of SQC
1. Descriptive Statistics: Summarizes data using measures like mean, variance, and standard
deviation.
2. Statistical Process Control (SPC): Monitors processes using control charts.
3. Acceptance Sampling: Determines the acceptability of a batch of products based on a sample.
SPC focuses on monitoring and controlling processes to ensure they operate at their full
potential.
Control charts are graphical tools used to study process variability over time.
Where:
4. Acceptance Sampling
Acceptance sampling determines whether a batch of items meets quality standards based on a
sample.
1. Single Sampling Plan: A fixed-size sample is inspected, and a decision is made to accept or reject
the lot.
2. Double Sampling Plan: Two samples are taken if the decision from the first sample is
inconclusive.
3. Sequential Sampling Plan: Samples are taken one at a time until a decision is reached.
Shows the probability of accepting a lot based on the defect proportion. It helps evaluate the
effectiveness of a sampling plan.
Scenario:
A manufacturing process produces items with weights measured in grams. A sample of 5 items is
taken daily for 10 days. The data is as follows:
Steps:
8. Benefits of SQC
10. Conclusion
1. Objectives of Econometrics
1. Linear Regression Model: Models the relationship between a dependent variable and one or
more independent variables.
2. Time Series Model: Analyzes data collected over time, accounting for trends, seasonality, and
autocorrelation.
3. Panel Data Model: Combines cross-sectional and time-series data.
4. Simultaneous Equations Model: Models systems of interdependent equations where variables
influence each other.
Where:
5. Assumptions of CLRM
Where:
β^=(X′X)−1X′Y\hat{\beta} = (X'X)^{-1}X'Yβ^=(X′X)−1X′Y
Where XXX is the matrix of independent variables and YYY is the vector of dependent
variables.
Decision Rule:
Reject H0H_0H0 if the p-value is less than the chosen significance level (α\alphaα).
8. Model Evaluation
1. Goodness of Fit:
o R2R^2R2: Proportion of variation in YYY explained by the model.
o Adjusted R2R^2R2: Accounts for the number of predictors in the model.
2. Model Diagnostics:
o Residual plots for homoscedasticity.
o Variance Inflation Factor (VIF) for multicollinearity.
o Durbin-Watson test for autocorrelation.
Scenario:
A researcher wants to study the relationship between advertising expenditure (XXX) and sales
(YYY).
10 25
15 30
20 50
25 65
30 70
Steps:
12. Conclusion
Econometrics provides a systematic framework to analyze economic data and validate theories.
By combining economic reasoning with statistical techniques, it facilitates informed decision-
making in academia, business, and policymaking. Mastery of econometric tools is essential for
deriving meaningful insights and solving real-world problems.
END