0% found this document useful (0 votes)
41 views29 pages

Presentation 4

Uploaded by

mujtaba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views29 pages

Presentation 4

Uploaded by

mujtaba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Statistics

Basic terminologies and types of statistics


What is Statistics?
• Statistics is simply defined as the study and manipulation of data. As we have
already discussed in the introduction that statistics deals with the analysis and
computation of numerical data. Let us see more definitions of statistics given
by different authors here.
• According to Merriam-Webster dictionary, statistics is defined as “classified
facts representing the conditions of a people in a state – especially the facts
that can be stated in numbers or any other tabular or classified arrangement”.
• According to statistician Sir Arthur Lyon Bowley, statistics is defined as
“Numerical statements of facts in any department of inquiry placed in relation
to each other”.
Basics of Statistics

• The basics of statistics include the measure of central tendency and the
measure of dispersion. The central tendencies are mean, median and
mode and dispersions comprise variance and standard deviation.
• Mean is the average of the observations.
• Median is the central value when observations are arranged in order.
• The mode determines the most frequent observations in a data set.
• Variation is the measure of spread out of the collection of data.
Standard deviation is the measure of the dispersion of data from the
mean. The square of standard deviation is equal to the variance.
Mathematical Statistics

• Mathematical statistics is the application of Mathematics to Statistics,


which was initially conceived as the science of the state — the
collection and analysis of facts about a country: its economy, and,
military, population, and so forth.

• Mathematical techniques used for different analytics include


mathematical analysis, linear algebra, stochastic analysis, differential
equation and measure-theoretic probability theory.
Types of Statistics

• Basically, there are two types of statistics.


• Descriptive Statistics
• Inferential Statistics

• In the case of descriptive statistics, the data or collection of data is


described in summary. But in the case of inferential stats, it is used to
explain the descriptive one. Both these types have been used on large
scale.
Descriptive Statistics

• The data is summarised and explained in descriptive statistics. The


summarization is done from a population sample utilising several
factors such as mean and standard deviation. Descriptive statistics is a
way of organising, representing, and explaining a set of data using
charts, graphs, and summary measures. Histograms, pie charts, bars,
and scatter plots are common ways to summarise data and present it in
tables or graphs. Descriptive statistics are just that: descriptive. They
don’t need to be normalised beyond the data they collect.
Inferential Statistics

• We attempt to interpret the meaning of descriptive statistics using


inferential statistics. We utilise inferential statistics to convey the
meaning of the collected data after it has been collected, evaluated,
and summarised. The probability principle is used in inferential
statistics to determine if patterns found in a study sample may be
extrapolated to the wider population from which the sample was
drawn. Inferential statistics are used to test hypotheses and study
correlations between variables, and they can also be used to predict
population sizes. Inferential statistics are used to derive conclusions
and inferences from samples, i.e. to create accurate generalisations.
Measure of Central Tendency
• Measures of central tendency are statistical measures that describe the
center or average of a set of data points. They provide a single value that
represents the central or typical value of a dataset. The three main
measures of central tendency are the mean, median, and mode.
1. Mean:
• The mean, also known as the average, is calculated by summing up all the
values in a dataset and then dividing by the number of values.
• Formula: Mean=Sum of all values/Number of all values.
• Example: For the dataset {10, 15, 20, 25, 30}, the mean is
10+15+20+25+30​=20
Measure of Central Tendency Cont..
2. Median:
• The median is the middle value in a dataset when it is arranged in
ascending or descending order.
• If there is an even number of values, the median is the average of the
two middle values.
• Example: For the dataset {10, 15, 20, 25, 30}, the median is 20. For
{10, 15, 20, 25}, the median is 20+25/2 =22.5
Measure of Central Tendency Cont..
3. Mode:
• The mode is the value that appears most frequently in a dataset.
• A dataset may have no mode, one mode (unimodal), or more than one mode
(multimodal).
• Example: For the dataset {10, 15, 20, 20, 25, 30}, the mode is 20.
• These measures of central tendency provide different insights into the central
value of a dataset, and the choice of which one to use depends on the nature
of the data and the specific goals of the analysis.
• It's important to note that each measure has its strengths and limitations. The
mean is sensitive to extreme values (outliers), while the median is more
robust in the presence of outliers. The mode is especially useful for
categorical data or when identifying the most common category is essential.
Measure of Dispersion
• Measures of dispersion are descriptive statistics that describe how
similar a set of scores are to each other
• The more similar the scores are to each other, the lower the measure of
dispersion will be.
• The less similar the scores are to each other, the higher the measure of
dispersion will be.
• In general, the more spread out a distribution is, the larger the measure of
dispersion will be.
Measure of Dispersion Cont..
125
• Which of the distributions 100
75
of scores has the larger 50
25
dispersion? 0
1 2 3 4 5 6 7 8 9 10

• The upper distribution has more 125


dispersion because the 100
75
scores are more spread out 50
25
• That is, they are less similar to each other 0
1 2 3 4 5 6 7 8 9 10
Measure of Dispersion Cont..
• There are three main measures of dispersion:
• The range
• The semi-interquartile range (SIR)
• Variance / standard deviation
• The Range.
• The range is defined as the difference between the largest score in the set of
data and the smallest score in the set of data, XL - XS
• What is the range of the following data:
4 8 1 6 6 2 9 3 6 9
• The largest score (XL) is 9; the smallest score (XS) is 1; the range is XL - XS = 9 -
1=8
When To Use the Range
• The range is used when
• you have ordinal data or
• you are presenting your results to people with little or no knowledge of
statistics
• The range is rarely used in scientific work as it is fairly insensitive
• It depends on only two scores in the set of data, XL and XS
• Two very different sets of data can have the same range:
1 1 1 1 9 vs 1 3 5 7 9
The Semi-Interquartile Range
• The semi-interquartile range (or SIR) is defined as the difference of
the first and third quartiles divided by two
• The first quartile is the 25th percentile
• The third quartile is the 75th percentile
• SIR = (Q3 - Q1) / 2
The Semi-Interquartile Range Example
• What is the SIR for the data to the right? 2
• 25 % of the scores are below 5 4
 5 = 25th %tile
• 5 is the first quartile 6
• 25 % of the scores are above 25 8
• 25 is the third quartile 10
• SIR = (Q3 - Q1) / 2 = (25 - 5) / 2 = 10 12
• When To Use the SIR 14
The SIR is often used with skewed data as 20
 25 = 75th %tile
it is insensitive to the extreme scores. 30
60
Variance
• Variance is defined as the average of the square deviations:  X   2

2 
What Does the Variance Formula Mean? N
• First, it says to subtract the mean from each of the scores
• This difference is called a deviate or a deviation score
• The deviate tells us how far a given score is from the typical, or average, score
• Thus, the deviate is a measure of dispersion for a given score
Standard Deviation
• When the deviate scores are squared in variance, their unit of
measure is squared as well
• E.g. If people’s weights are measured in pounds, then the variance of the
weights would be expressed in pounds2 (or squared pounds)
• Since squared units of measure are often awkward to deal with, the
square root of variance is often used instead
• The standard deviation is the square root of variance

• Standard deviation = variance


• Variance = standard deviation2
Computational Formula

• When calculating variance, it is often easier to use a computational


formula which is algebraically equivalent to the definitional formula:

 X
2

X  
2

  
2

N X

2
 
N N
• 2 is the population variance, X is a score,  is the population mean,
and N is the number of scores
Computational Formula Example
X X2 X- (X-)2
9 81 2 4
8 64 1 1
6 36 -1 1
5 25 -2 4
8 64 1 1
6 36 -1 1
 = 42  = 306 =0  = 12
Computational Formula Example Cont..
 X  X 
2 2

X 
2 2
 
N N

2

N 12

2
6
306  42 2
 6
6
306  294

6
12

6
2
Variance of a Sample
• Because the sample mean is not a perfect estimate of the population
mean, the formula for the variance of a sample is slightly different
from the formula for the variance of a population:

s
2

 X X 2

N 1
• s2 is the sample variance, X is a score, X is the sample mean, and N is
the number of scores
Measure of Skew
• Skew is a measure of symmetry in the distribution of scores
Normal (skew = 0)

Positive Skew
Negative Skew
Measure of Skew Cont..
• The following formula can be used to determine skew:

 
 X X
3

3 N
s 
 X  X 
2

N
Measure of Skew Cont..
• If s3 < 0, then the distribution has a negative skew
• If s3 > 0 then the distribution has a positive skew
• If s3 = 0 then the distribution is symmetrical
• The more different s3 is from 0, the greater the skew in the
distribution
Statistical data and representation of data
• Statistical data refers to the information collected through various methods, such
as surveys, experiments, or observations. It can be numerical or categorical and
is often used to analyze and make inferences about a population or a
phenomenon. Representing data visually is a crucial aspect of statistical analysis
as it helps in better understanding and communication. Here are some common
types of statistical data and methods of representation:
• Types of Statistical Data:
• Numerical Data (Quantitative): Consists of numerical values and can be
further classified as discrete or continuous. Examples include age, height,
income, and temperature.
• Categorical Data (Qualitative): Represents categories or labels. Examples
include gender, color, and types of cars.
Statistical data and representation of data
Cont..
• Methods of Representation:
• 1. Tables:
• Simple and effective way to organize and present data.
• Useful for small datasets and presenting categorical data.
• 2. Charts and Graphs:
• Bar Charts: Suitable for representing categorical data. Bars are used to represent the
frequency or proportion of each category.
• Histograms: Similar to bar charts but used for displaying the distribution of continuous data.
The bars are contiguous.
• Pie Charts: Represents parts of a whole. Useful for displaying the composition of a
categorical variable.
• 3. Line Charts:
• Useful for showing trends and patterns over time. Often used with time-series data.
Statistical data and representation of data
Cont..
• 4. Scatter Plots:
• Used to visualize the relationship between two numerical variables. Each point represents an
observation.
• 5. Box Plots (Box-and-Whisker Plots):
• Displays the distribution of a dataset and highlights the central tendency, spread, and outliers.
• 6. Frequency Distributions:
• Tables or graphs that show the frequency of different values or ranges in a dataset.
• 7. Measures of Central Tendency:
• Mean: Average value of a dataset.
• Median: Middle value when the data is arranged in ascending order.
• Mode: Most frequently occurring value.
• 8. Measures of Dispersion:
• Range: The difference between the maximum and minimum values.
• Variance and Standard Deviation: Indicate the spread or dispersion of data around the mean.
Statistical data and representation of data
Cont..
9. Correlation Coefficient:
• Measures the strength and direction of the linear relationship between two numerical
variables.
10. Regression Analysis:
• Examines the relationship between one dependent variable and one or more independent
variables.
• These methods of representation help in summarizing, analyzing, and
interpreting data for better decision-making and communication.
Choosing the appropriate method depends on the nature of the data
and the insights you want to convey.

You might also like