0% found this document useful (0 votes)
69 views61 pages

Health Statistics: Principles of Secondary Data Analysis

This document discusses principles and steps for secondary data analysis. It covers: - Sources and advantages/disadvantages of secondary data - Factors to consider when using secondary data - Types of secondary data analysis including descriptive statistics and hypothesis testing - Steps in the secondary data analysis process including data cleaning, filtering outliers, and handling missing data

Uploaded by

Charles Sandy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views61 pages

Health Statistics: Principles of Secondary Data Analysis

This document discusses principles and steps for secondary data analysis. It covers: - Sources and advantages/disadvantages of secondary data - Factors to consider when using secondary data - Types of secondary data analysis including descriptive statistics and hypothesis testing - Steps in the secondary data analysis process including data cleaning, filtering outliers, and handling missing data

Uploaded by

Charles Sandy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

HEALTH STATISTICS

PRINCIPLES OF SECONDARY DATA ANALYSIS


SOURCES OF SECONDARY DATA
ADVANTAGES OF SECONDARY DATA
DISADVANTAGES OF SECONDARY DATA
FACTORS TO BE CONSIDERED FOR
SECONDARY DATA
SECONDARY DATA ANALYSIS
STEPS IN SECONDARY DATA ANALYSIS
EVALUATION PROCESS OF SECONDARY DATA
ANALYSIS
EVALUATING THE QUALITY OF SECONDARY
DATA ANALYSIS
IMPORTANCE OF CLEANING DATA
• Before analysing data it is important to clean it.
• Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset. When
combining multiple data sources, there are many opportunities for data to be
duplicated or mislabelled.
• By identifying errors or corruptions, correcting or deleting them, or manually
processing data as needed to prevent the same errors from occurring ensuring
data is correct, consistent and usable.
• Exploratory data analysis helps find outliers and inaccuracies in data
FILTER UNWANTED OUTLIERS

• Often, there will be one-off observations where, at a glance, they do not appear to fit
within the data you are analysing.
• If you have a legitimate reason to remove an outlier, like improper data-entry, doing
so will help the performance of the data you are working with.
• However, sometimes it is the appearance of an outlier that will prove a theory you
are working on.
• Remember: just because an outlier exists, doesn’t mean it is incorrect. This step is
needed to determine the validity of that number.
• If an outlier proves to be irrelevant for analysis or is a mistake, consider removing it.
HANDLING MISSING DATA
• You can’t ignore missing data because many algorithms will not accept missing
values. There are a couple of ways to deal with missing data. Neither is optimal,
but both can be considered.
• As a first option, you can drop observations that have missing values, but doing
this will drop or lose information, so be mindful of this before you remove it.
• As a second option, you can input missing values based on other observations;
again, there is an opportunity to lose integrity of the data because you may be
operating from assumptions and not actual observations.
• As a third option, you might alter the way the data is used to effectively navigate
null values.
• At the end of the data cleaning process, you should be able to answer these questions as a
part of basic validation:
• Does the data make sense?
• Does the data follow the appropriate rules for its field?
• Does it prove or disprove your working theory, or bring any insight to light?
• Can you find trends in the data to help you form your next theory?
• If not, is that because of a data quality issue?
SECONDARY DATA ANALYSIS
TYPES OF SECONDARY DATA ANALYSIS
SDA

DESCRIPTIVE ANALYTIC

SUMMAR EXPLORATO
Y RY HYPOTHES
STATISTI STATISTICS IS TESTING
CS TESTING

MEASURES OF
CENTRAL MEASURES INFERENC CASUALTI
DISTRIBUTIO
TENDACY OF ES ES
NS
DISPERSION
e.g. Mean, Median, Mode e.g. Range, IQR, Variance e.g. Normal
distribution
DESCRIPTIVE ANALYSIS

• EXPLORATORY STATISTICS TESTING E.G NORMAL DISTRIBUTION ETC


A) MEASURES OF CENTRAL TENDANCIES
• Three measures frequently used to provide a “typical value” for a given continuous
variable in a specific population.
Quick definitions:
Mode
• The most frequently occuring score
Median
• The mid-point of a set of ordered scores
mean
• The result of dividing the arithmetic sum of
scores by the number of scores
MODE

• The mode of a distribution is the value that is observed most


frequently in a given data set (rarely used).
- There may be no mode - when ?
- there may be more than one mode - when ?
- can be misinterpreted (is a distribution skewed or
bimodal ?).
- not very amenable to statistical tests.
CALCULATING THE MODE

To compute the mode:


• Arrange the data in sequence from low to high
• Count the number of times each value appears
• The most frequently appearing value is the mode
EXAMPLE: FINDING THE MODE
• Annual salary
2, 2, 3, 3, 3, 3, 4, 4, 7, 8
• The mode is three 3

• Incubation period for 6 hepatitis affected persons


24, 25, 29, 29, 30, 31
• Mode is 29
MEDIAN

The median describes literally the middle of the


data. It is defined as the value above or below
which half (50%) the observations fall.
COMPUTING THE MEDIAN
•The number of observations or scores is referred to as "n".
Arrange the scores in order from smallest to largest (ascending order)
Count the number of scores (determine n)

 if n is an odd number, then


• Median = the (n+1) / 2 th observation

For example, consider the observations


8 ,25 ,7 ,5 ,8 ,3 ,10 ,12 ,9
Arranged in order, the observations are
3 ,5 ,7 ,8 ,8 ,9 ,10 ,12 ,25

In this case, n=9 ( an odd number); therefore, the median is the (9+1)/2=5 th observation.

•If n is an even number, then


• Median = the average of the n / 2 th and (n /2)+1 th observations
COMPUTING MEDIAN (EVEN NUMBER
OBSERVED)
• For another example, consider the observations
11 , 7 , 10 , 9 , 15 , 13 ,

• Arranged in order, the observations are


7 , 9 , 10 , 11 , 13 , 15

• n this case, n=6 ( an even number); therefore, the median is the:


• The average of the observations (n/2) + (n/2+1)
• The average of the 3 and 4 observations
= (10+11)/2
= 10.5
MEDIAN

• The advantage of this measure is that it is unaffected by


extreme values !
• The disadvantage is that it is selected by its rank and does not
contain information on the other values in the distribution.
• It is also less amenable than the mean to statistical tests.
MEAN
• Most commonly used as measure of location.
• It is calculated by adding all the observed values and dividing by the total sample
size:
each observation is noted as X
the total number of observations n
summation process by sigma 
the mean itself is expressed as X
METHOD OF COMPUTING THE MEAN
• To compute the mean:
Count the number of scores (determine “n”)
Determine the sum of the scores by adding them
Divide the sum by “n”

• For example, consider the observations


8 , 25 , 7 , 5 , 8 , 3 , 10 , 12 , 9

• In this case, n=9 and the sum=87; therefore, the mean


= 87 / 9
= 9.67

• For another example, consider the observations


8 , 45 , 7 , 5 , 8 , 3 , 10 , 12 , 9

• In this case, n=9 and the sum=107; therefore, the mean


= 107 / 9
= 11.89
MEAN CONT….
• The mean has a lot of good theoretical properties and it is used as the basis of many statistical tests .
• For a symetrical distribution the mean is a good summary statistic.
• It is less useful for an asymetric distribution

Q. What is its limitation as a summary statistic in asymetrical distributions?


A. It can be distorted by outliers, therefore giving a poor “typical” value.
• Imagine weight in kgs in a sample population of 5 people

50, 60, 50, 40, 120

• The mean is calculated as 62 kilos.


• Is this value of 62 kilos “typical” for the observations ?
CHOOSING MEASURES OF CENTRAL
TENDANCY

• Depends on the nature of the distribution

• For continuous variables in a unimodal and symmetric distribution the


mean, median and mode are identical.

• With a shared distribution the median may be more useful

• For statistical analyses the mean is the preferred measure.


B) MEASURES OF DISPERSION
CONCEPT OF DISPERSION

• The taller the curve has the less the dispersion


• The flatter the curve has the more the dispersion
OBJECTIVES OF MEASURING DISPERSION

• To determine the reliability of an


average.
• To compare the variability of two or
more series.
• For facilitating the use of other statistical
measures.
• Basis of Statistical Quality Control.
ABSOLUTE MEASURES
COEFFICIENT RANGE

FORMULA CR = XL – XS
XL + X S
INTERQUARTILE RANGE & QUARTILE
DEVIATION
INTERQUARTILE RANGE
• Percentiles: those values in a series of observations, arranged in ascending
order of magnitude, which divide the distribution into two equal parts (thus
the median is the 50th percentile).
• Quartiles: the values which divide a series of observations, arranged in
ascending order, into 4 equal parts. (Thus the 2nd quartile is the median).
• The interquartile range represents the central portion of the distribution and is
calculated as the difference between the third quartile and the first quartile.
This range includes about one-half of the observations in the set, leaving one
quarter of the observations on each side.
MEDIAN AND QUARTILES
MEASURE OF DATA
Interquartile rangeVARIABILITY
• The difference between the score representing the 75th percentile and the score
representing the 25th percentile
• Arrange observation in ascending order
• Find the position for Q1 and Q3
• Identify values and the inter-quartile range = Q3 - Q1
• Example: 29 , 31 , 24 , 29 , 30 , 25
Arrange: 24 , 25 , 29 , 29, 30 , 31

Q1 = value of (n+1)/4=1.75
Q1 = 24+0.75 = 24.75
Q3 = value of (n+1)*3/4=5.2
Q3 = 30+0.2 = 30.2
Q3 – q1 = 30.2 – 24.75
IQR

Advantages:
•Easy to calculate and understand
•Can be calculated even for open ended distributions
•Not affected by change of origin
Disadvantages:
•Not based on all observations in the data as only 3rd and 1st quartiles are
used to calculate
•Further mathematical treatment is not possible
•So, quartile deviations improve upon the simple range in multiple ways
STANDARD DEVIATION
• Standard deviation is the widely used measure of dispersion and
together with arithmetic mean, is most commonly used to describe
shape and scale of data distributions.
• Standard deviation is defined as the square root of the average of sum
of the squared differences between the observations and the
arithmetic mean.
• It is also known as the “root mean squared deviation”.
• In mathematical notations,
STD DEVIATION MATHEMATICAL NOTION

 2

(x  x)
SD 
i

n 1

n x    x 
2 2

SD 
i i

n( n  1 )
VARIANCE
•The most frequent and most informative measure is the VARIANCE and its related functions
•It is another measure of dispersion
•It is the square of standard deviation
•The variance is computed in stages:

1. Calculate the mean as a measure of central location (mean)


2. Calculate the difference between each observation and the mean (DEVIATION)
(X-x)
3. Next square the differences (SQUARED DEVIATION)
(X-x)2

Q. What is the effect of this ?


- Negative and positive deviations will not cancel each other out.
- Values further from the mean have a bigger impact.
VARIANCE COMPUTED STAGES

4. Sum up these squared deviations (SUM OF THE SQUARED DEVIATIONS)


Σ (x -x)2
5. Divide this SUM OF THE SQUARED DEVIATIONS by the total number of observations
minus 1 (n-1) to give the VARIANCE
Σ (x - x)2
N-1
• This is a measure of the variability of the data
• Why divide by n - 1 ?
• This is an adjustment for the fact that the mean is just an estimate of the true population mean.
• It tends to make the variance bigger.
STANDARD DEVIATION

Score (x) Mean (x) Deviation Squared deviation


(x –x) (x – x )2

13
12
13
14
10
16
15
24
20
18
Σx = 155
CALCULATING STANDARD DEVIATION

(x  x)
 2
= 156.5 = 4.17
SD 
i

n 1 9

Lets use the computational


formula…………….
CHOOSING THE MEASURES OF CENTRAL
LOCATION OF DISPERSION
THE COEFFICIENT OF VARIATION

• The coefficient of variation (CV) allows us to compare the variation of


two (or more) different variables.
• Explanation of the term – sample coefficient of variation: the sample
coefficient of variation is defined as the sample standard deviation
divided by the sample mean of the data set.
• Usually, the result is expressed as a percentage.
THE COEFFICIENT OF VARIATION CONT…..

s
Sample CV   100%
x

NOTE: The sample coefficient of variation


standardizes the variation by dividing it
by the sample mean.
THE COEFFICIENT OF VARIATION CONT…..

• The coefficient of variation has no units since the standard deviation and the
mean have the same units, and thus cancel out each other.
• Because of this property, we can use this measure to compare the variations for
different variables with different units.
• Example: the mean number of parking tickets issued in a neighborhood over a
four-month period was 90, and the standard deviation was 5. The average
revenue generated from the tickets was $5,400, and the standard deviation was
$775. Compare the variations of the two variables.
• Solution is on the next slide.
THE COEFFICIENT OF VARIATION CONT…..

The solution:

Since the
Since the CV
CV isis larger
larger for
for the
the revenues,
revenues, there
there isis
more variability
more variability inin the
the recorded
recorded revenues
revenues than
than in
in
the number
the number of
of tickets
tickets issued.
issued.
ANALYTIC STATISTICS

• HYPOTHESIS TESTING
• STATISTICAL MODELS ARE EEMPLOYED E.G INFERENCES, CASUALITY
CROSS SECTIONAL DATA & LONGITUDINAL
DATA
• METHODS USED IN CROSS SECTIONAL DATA ANALYSIS CANNOT ALWAYS BE
UUSED TO ANALYSE PANEL/LONGITUDINAL DATA
• REASONS:

You might also like