Statistics For Data Analysis
Statistics For Data Analysis
Dr.S.Saudia,
Assistant Professor,
CITE, M.S.University
Objectives of the Chapter.
• Sample: Part of the population from which information is collected. Always not all the information is
collected from the entire population for analysis.
A statistic value is a known numerical summary of a sample of the population. Eg: The
proportion of diabetic patients feeling fatigue often.
Types of Statistics[1],[4]
• Descriptive Statistics
• The Statistical methods used for summarizing and describing information in data.
• It includes construction of graphs, charts and tables, calculation of various descriptive measures:
average, measures of variation and percentiles.
Eg: For an event of tossing a dice, descriptive statistics can be summarizing the frequencies of outcomes
as:
Eg: For an event of tossing a dice, inferential statistics is verifying whether the dice is fair or not.
Inferential statistics is done on the information obtained from Descriptive Statistics. An inference is made on
the population based on information obtained from the sample.
• Parameter and Statistics
The statistical values are used to find the unknown inferential parameters about the population.
Objective of Statistics methods is to make inferences about the population from an analysis of information
contained in sample data.
• Variable
Variable is a characteristic that varies from one person to person/ object to object of a population. Eg: height,
weight.
The variables can be Quantitative or Qualitative variables.
Nominal Scale: the categories of the qualitative variables are in an unordered scale.
There is no order for these categories or only naming of the categories happen in this scale.
It cannot be said that one category is greater/ lesser than the other.
Eg: Gender, Marital Status.
The difference between the categories need be not the same or it is unknown.
Eg: Education rating, Temperature, Strength of opinion polls, customer satisfaction (from very satisfied, more satisfied,
satisfied, dissatisfied etc.), Pain Intensity etc.
Interval Scale: the values of the quantitative variables are on an ordered scale and the difference
between the interval values is known and is equal.
Value zero in an interval scale is arbitrary which means zero is also a measurement and not that
the measurement value is zero.
Eg: A temperature at 0 degree celsius does not mean that there is no temperature, Time.
Eg: A temperature at 0 Kelvin does not mean there is no temperature, it means that the temperature is zero., height, weight
etc.
Ratio Scale of Decision Tree To decide on the levels/ scale of measurement [4]
height [7]
Statistical Tests and Measurements possible on different scales [4]
In the highest level/ scale of a data representation which is the numeric ratio level, all the measurements can be
done on the data-set.
There are different techniques of sampling. It depends mainly upon on the resources available like
time and money needed to conduct the study and also the nature of population.
DEPENDS
UPON
The sampling techniques should select an unbiased, representative sample. Otherwise the sample shall produce
sampling error.
Eg: If the study is on the African and European population, the sample should have
representatives of Africans and Europeans
equal number of Europeans and Africans to be unbiased.
• Simple Random Sampling
This sampling technique involves the selection of random objects based on some random numbers as
the sample of a population. The method produces an unbiased and representative sample. Every object of the
sample is equally likely to be a member of the sample.
Eg: The samples are from a shopping mall or from someone’s friends/ relatives/ people passing by.
The latest items produced from a manufacturing industry is selected to be a sample.
It produces very much biased (sometimes self selection bias- when people who are interested in the problem
participate in the survey) and un representative samples.
It is easier than Simple Random Sampling to produce a random approximation of the sample.
However if there is a pattern in the data sample, then a certain type of object getting selected more is common in
Systematic Sampling.
If the clusters chosen are not more different from each other, the the so selected sample can be
biased or un representing the population.
Eg: If the population corresponds to the whole of the population, then the different nationals, other
characteristics like: age, occupation, culture etc. correspond to a strata.
Qualitative variable:
The number of observations of a particular qualitative variable is the frequency/count of that variable.
4. relative frequency distribution-a table listing all the classes and their relative frequencies. Sample
size decides the credibility of relative frequency. The relative
frequencies add up to 1.
5. cumulative frequency-sum of frequencies up to a particular class. It gives the frequency
above or below a reference level.
7. pie chart: The qualitative data is graphically represented as a pie-chart. It is a disc divided
into a number of pieces depending upon the frequencies of the classes. The
angle for a class is obtained by dividing the relative frequency by 360 degree.
Nominal data are represented by pie-chart .
8. bar graph: A bar graph is also a graphical representation where the classes are displayed
on the horizontal/ vertical axis and the frequencies are displayed on the
vertical/ horizontal axis. Accordingly they are called horizontal bar graph and
vertical bar graph respectively.
O O A B A O A A A O B O B O O A O O A A A A AB A B A A O O A
O O A A A O A O O AB
Cumulative frequency calculation is more significant for ordinal and quantitative data.
Exercise: Write the R code for finding the frequency, relative frequency and cumulative
frequency of the blood group data
Pie Chart and Bar graph [1]
Exercise: Write the R code to plot the pie-chart of the blood group data
Quantitative variable:
The frequency of the quantitative variable data if they are less in number is calculated as for qualitative variables.
If the number of the data is quite large, the data are grouped into classes before calculating the frequency
distribution.
Generally 5-15 class intervals are chosen. Percent, cumulative Frequency, Relative Frequency etc. holds good even
for the quantitative data like the qualitative data.
1. histogram distribution- histogram for a grouped data displays the frequency or the relative
frequencies of each class interval. It is a bar graph of the grouped data.
Exercise: Write the R code for finding the frequency, relative frequency and cumulative frequency
of the data and plot the histogram of the frequencies
2. O give: A cumulative frequency can be visualized using a curve called an Ogive. Ogives can be
plotted against the upper or the lower limits of their class intervals and accordingly they are called less than
Ogive or greater than Ogive.
For the given distribution the cumulative frequency curve/ O give is shown below.
Frequency Table , Frequency distribution and Ogive [14] for the population whose age is recorded [14]
Points to Remember:
The frequency distribution of the population is called a population distribution and that of a sample is called a
sample distribution.
As the sample size increases, the sample relative frequency is very close to that of the population and image
becomes clearer.
Also as the sample size increases, the histogram curve of the frequency distribution becomes more smooth/
continuous as can be seen below.
Histograms in order for sample sizes 100, 2000 and the whole population [1]
Points to Remember:
A summary of the population can be made by looking at the shape of the distribution curve.
As shown below for the bell and U shaped distributions, the populations are very clearly different.
The distributions can also be skewed to one direction, left skewed or right skewed as shown below. These are
asymmetrical distributions.
Distributions skewed to the right and Distribution skewed to the left [1]
3. Measures of central values [1]
That single value/measure which can be representative (a typical value) of the whole population is called the
measure of central tendency. It is the middle of a population.
1. Mean
2. Median
3. Mode
Mean and median are measures of central tendencies for quantitative data only whereas mode is a
measure of central tendency for qualitative data.
For a symmetric population (normal distribution), these values are close to each other. Also if the population is
almost similar, these values will be same.
For an unsymmetrical or skewed population, these values are different from each other as shown in figure above.
Measures of Central Tendencies :
Mode of a qualitative or discrete quantitative variable is the value of the variable with highest frequency.
The mode can be easily determined from the frequency distribution table or graph.
The mode in the frequency distribution table 6 [1] is A. This means that A is the most common blood group.
When the data is large, continuous and divided into classes, the mode is a mode class which is the class interval with
the highest frequency.
Median of a quantitative variable is that central value which divides the ordered values of the variable set into a set
less than the median and a set greater than the median. The measure demands a data set which can be ordered.
n +1
Median = floor or
2
n +1
Median = ceil
2
From this measure an idea is obtained about the total number of values less than the central measure and the total
number of values above the central measure can be found. Both numbers are equal.
The measure of median is not affected by the outlier values (extreme values) in the data set.
Consider the data set below which are prices of some cottages in an area for sale.
The median ($137,500) from this data set is a measure from among the normal values ($125,000, $127,000, $135,000,
$140,000, $148,000, $150,000). in the data set and it not get affected by the extreme values ($110,000 and
$380,000).
Thus,
• Median is the best measure for asymmetrical data / for a skewed distribution or when the distribution is not
normal.
• It is not affected by all the values in the dataset and so is reliable when there are outliers.
• Good for data in ratio and interval scale. =
Mean/ Arithmetic Mean
Mean of a quantitative variable is that common central value is the sum of all the observations of the variable divided
by the total number of observations.
_
When n is the number of items in the data set, the mean , x is
The mean value calculated shall be influenced by the extreme values or outliers.
Geometric Mean of a quantitative variable is that value which is the nth root of the product of all the n observations.
The geometric mean is the average of a set of products.
It is commonly used
or
where
Exercise
Why Geometric Mean over Arithmetic Mean [21]
Eg: To find the average interest rate. Consider that an amount of $ 100 is invested. If in the first year, an interest
percent of 10% is drawn and in the second year an interest of 20% and in the third year a 39%, then after three years,
the amount drawn shall be
=100(1+r1)(1+r2)(1+r3)
=100(1+.10)(1+.20)(1+.30)= $ 171.6
The amount drawn after using arithmetic mean rate of interest is 100(1+.2) (1+.2) (1+.2)=$ 172.8
The amount is more than the actual amount.
So another mean rate, r is to determined which must give the correct amount drawn at the end.
(1+r) 3= (1+r1)(1+r2)(1+r3)
If x1, x2, x3, x4, ....xn have f1, f 2, f3, f 4, .... f n then the geometric mean is
(
GM = x1f1 , x 2f2 , x3f3 ...x nf n ) N
Exercise:
Write the R code for finding the mean compound interest if the interests for the first
five years are 10%,20%, 30%, 40%, 50%.
Harmonic Mean [21-[22]
Harmonic Mean of a quantitative variable is reciprocal of the arithmetic mean of the reciprocal of observations.
1 n
HM = n = n
1 1 1
∑ ∑
n i =0 x i i =0 x i
Consider a situation below where caps are ordered in bulk under three price categories.
If the average price for a cap is to be fixed, the arithmetic mean= 12+16+15/3 will not work out.
This is because the number of xi values ($12,$16, $15) is not equal to 3. It is
240/12 for the 1st cap
160/16 for the 2nd cap
300/15 for the 3rd cap
• Also the sum o f observations is not equal to $12+$16+$15. It is
12x240/12 for the 1st cap + 16x160/12 for the 2nd cap + 15x300/15 for the 3rd cap
So the arithmetic mean is =12x240/12 + 16x160/16 + 15x300/15 = 240 + 160 + 300 = W1+W2+W3
The number of observations (N/4) smaller than Q1 is same as the number lying between Q1 and Q2, or between Q2 and
Q3, or larger than Q3.
For continuous observations, one quarter of the observations are smaller than Q1, two-quarters are smaller than Q2 and
three quarters are smaller than Q3.
So Q1, Q2 and Q3 are the values corresponding to cumulative frequencies, n/4, 2n/4, 3n/4 respectively for a grouped data
set.
Quartiles [24]
Percentile
Percentiles are such measurements of the variable which divide the total number of observations into 100 equal parts.
The first percentile, P1 is that value of the variable which divides the bottom1% values from the top 99% values.
The second percentile, P2 is that value of the variable which divides the bottom2% values from the top 98% values.
Percentile [25]
Deciles
Deciles are such measurements of the variable which divide the total number of observations into 10 equal parts.
The first decile, D1 is that value of the variable which divides the bottom10% values from the top 90% values. It is also the
10th percentile, P1.
Similarly, the second decile, D1 is the 20th percentile, P20 and so on.
Decile [26]
Five number Summary and Box Plot
Five number summary of a variable consists of minimum, maximum and three quartiles written in the increasing order.
They provide information on center and variation of variable.
Box plot is based on the five number summary and it gives a graphical display of the center and the variations.
Box plots can be in two types: 1. Box plots and 2. Modified Boxplot.
Range: The sample range of a variable is the difference between the maximum and minimum values
of the variable in the dataset.
Range= Max-Min
Range determined for a dataset cannot decrease but can increase when more values are added to the dataset.
Interquartile range : The sample interquartile range, IQR of a variable is the difference between the
first and third quartiles of that variable.
IQR=Q3-Q1
− 2
n
∑ i
x − x
S x = σ = i=1
n
Here mean/ average is used as the standard. The value is always positive.
The mean of all squared deviations between observations and mean of the observation is called Sample Variance,
Sx2.
− 2
n
∑ xi − x
i =1
Sx = σ =
2 2
n
For normal distribution/ symmetric bell shaped population, it is experimentally determined that:
−
68% of the values lie within x± σ
−
95% of the values lie within x ± 2σ
−
99.7% of the values lie within x ± 3σ
Standard deviation fluctuates less when compared other measures of dispersion when moving from sample to
sample.
Mean Deviation/ Mean Absolute Deviation [22]:
Mean Deviation can defined as the average of the absolute deviations of observations from
the mean/ any other specified value of the variable.
1 n
Mean Deviation about A = ∑ xi − A
n i=1
Calculate the Mean Deviation of the following data about the median: 8,15,53,49,19,62,7,15,95,77
Moments : Moments about any arbitrary constant A are defined as
1
µ1' =
n
∑ ( x − A ) is the 1st moment
1
µ2' = ∑ ( x − A ) is the 2nd moment
2
n
1
µ3' = ∑ ( x − A ) is the 3rd moment
3
n
The four Moments about a zero are called Raw Moments
and the moments about mean are called Central Moments.
1
µ1' =
n
∑ ( x ) is the 1st moment about 0 1 −
−
µ1 = ∑ x − x is the 1 moment about x
st
1 n
µ2' = ∑ ( x ) is the 2 nd moment about 0
2
− 2 −
1
n µ2 = ∑ x − x is the 2 moment about x
nd
1 n
µ3' = ∑ ( x ) is the 3rd moment about 0
3
− 3 −
n 1
µ3 = ∑ x − x is the 3 moment about x
rd
n
− 4 −
1
µ4 = ∑ x − x is the 4 moment about x
th
n
Moments are used to describe the basic peculiarities of the data from its frequency distribution like
: measure of the central tendency is given by the first raw moment
measure of dispersion is given by the 2nd moment about mean
symmetry/ skewness of the curve is given by 3rd moment about mean
kurtosis is given by 4th moment about mean.
Moments [29]
Skewness [31] : Skewness mentions extent of asymmetry of a dataset. It also speaks about the direction
of variation of the dataset from the mean.
It can be Negative Skewness or Positive Skewness depending upon the if the distribution is skewed to the right or the left
of the dataset’s mean respectively.
The skewness will give knowledge as to how much is the dataset greater than or is less than the mean.
From the above figure, note the positions of Mean, Median and Mode. The difference between the Mean and Mode can
give the measure and direction of Skewness.
Measure of Skewness: To find extend of asymmetry and to give the direction (positive or negative)
Mean − Mode
Skewness =
S tan dard Deviation
Skewness is positive when Mean is larger than Median and Mode and Vice-versa.
3 ( Mean − Median )
Skewness =
S tan dard Deviation
Bowley’s Measure:
Skewness =
( Q3 − Q2 ) − ( Q2 − Q1 )
( Q3 − Q2 ) + ( Q2 − Q1 )
Where Q1 , Q2 , Q3 are the three quartiles. Q2 is the median. For a positively skewed distribution, Q3 will be away
from Q2 and Q1 and vice-versa for a negatively skewed distribution.
The Bowley’s formula is used to calculate skewness when the dataset is a grouped dataset similar to the one mentioned
in the exercise in slide 70 where the mode mean and standard deviation are difficult to calculate.
Moment Measure:
m3
Skewness ( γ 1 ) =
σ3
Where m3 is the third moment and σ is the Standard deviation. For symmetric distribution, for each positive value of (xi-
mean) there is a negative value. When the deviations are cubed, positive values retain their positive sign and negative
sign and so m3 will be zero.
But for positive skews, large positive values of (xi-mean) are magnified considerably when cubed making m3 positive
and vice-versa for negative skews.
Thus positive, negative and zero values of γ 1 correspond to positively skew, negatively skew or symmetrical curves.
Kurtosis [22]: Two distributions may have the same measures for central tendency, dispersion and
skewness but the concentration of the values around the mode can be different. This concentration is called Kurtosis.
It defines the shape of a datasets distribution or otherwise speaks about the peak or flatness of a distribution as against the
normal distribution.
According to the peak of the distributions (the kurtosis) can be Leptokurtic curve with a maximum peak, Mesokurtic or a
normal curve with a normal peak and a Platykurtic curve with a flat peak as shown in figure below.
Kurtosis [31]
Measure of Kurtosis:
m4
Kurtosis (γ 2 ) = − 3 = β2 − 3
σ 4
Where m4 is the third moment and σ is the Standard deviation. For symmetric distribution, for each positive value of (xi-
mean) there is a negative value.
Thus positive, negative and zero values of γ 2 correspond to leptokurtic, platykurtic and mesokurtic curves.
Thus
Representative of a dataset is given by Central measures of Tendency
Direction of the distribution or the presence of data to the left or right of the mean
is given by Skewness
Firms number 20 50 69 30 25 19
Clues for Solution.
There are three formulae for calculation of Skewness:
Pearson’s first measure which uses Mode, Mean and Standard Deviation
Pearson’s second measure which uses Mean and Standard Deviation.
And the Bowley’s Measurement which uses The three quartiles.
The Mean, Median, Mode and Standard deviation involved in the first two formulae are difficult calculating for the above
grouped dataset and so the Skewness for this dataset is calculated using Bowley’s Measurement.
Here data items are in the groups (0-20, 20-50, 50-100,……, 500-1000), so for finding the Quartiles use the
cumulative frequency values of the data items.
Where n is the total number of items (here the total number of firms =213) and the data items are sales values
from 0-20, 20-50,…
SALES 0-20 20-50 50-100 100-250 250-500 500-1000
Firms number 20 50 69 30 25 19
Q1= value corresponding to n/4th Cumulative frequency=213/4th value= 53rd sales value (from the table, 39 approximately )
Q2= value corresponding to n/2th Cumulative frequency=213/2th value= 106thsales value (from the table,76 approximately )
Q3=value corresponding to 3n/4thCumulative frequency=3x213/4thvalue=159thsales value (from the table, 203
approximately)
Use these Q1, Q2, Q3 values in Bowley’s Measurement and find the Skewness measurement.
References:
[1] ‘The Nature of Statistics’, Agresti and Finlay, Johnson and Bhattacharya, Weiss, Anderson and Sclove and Freud
[2] www.quora.com
[3] eople.revoledu.com
[4]ww.youtube.com/watch?v=LPHYPXBK_ks
[5]http://www.uth.tmc.edu/uth_org
[5] http://coolcosmos.ipac.caltech.edu
[6] https://www.socialresearchmethods.net
[7]http://www.clipartpanda.com/
[8]http://www.statisticshowto.com
[9] https://www.youtube.com/watch?v=be9e-Q-jC-0
[10] http://grocery88.ml/lowa/convenience-sampling
[11] https://faculty.elgin.edu/
[13] ducation-savvy.blogspot.com
[14] tudy.com/academy/lesson/definition
[15] https://www.youtube.com/watch?v=0ZKtsUkrgFQ
[[16] http://www.lightbulbbooks.com
[17] http://www.picquery.com/mode
[18] ww.dsource.in/resource/elephant
[19] https://www.youtube.com/watch?v=PKWVAIP17pw
[20] https://www.youtube.com/watch?v=trS95t3rs8Q
[21] https://www.youtube.com/watch?v=HFuSLTQ1Izc&t=541
[22] Statistical methods, ‘N. G.Das’, McGraw Hill Companies.
[23] ttps://www.youtube.com/watch?v=ZfHXdIFS-mQ
[24] https://onlinecourses.science.psu.edu/stat100/node/11
[25] http://www.psychometric-success.com
[26] https://stackoverflow.com
[27] https://en.wikipedia.org
[28] http://www.mathsisfun.com
[29] http://www.sigmetrix.com
[30] https://www.slideshare.net
[31] https://www.youtube.com/watch?v=1da4auXziT8
[32] http://www.mathcaptain.com