0% found this document useful (0 votes)
15 views

Introduction To Biostatistics

Uploaded by

omarmamluky254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Introduction To Biostatistics

Uploaded by

omarmamluky254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Introduction to

Biostatistics
Lecture Objectives
◼ Overall: To give a basic understanding
of descriptive statistics

◼ Specific:
– understand the branches of statistics
– understand the different types of data that
can be collected
Statistics
◼ The science of collecting, monitoring,
analyzing, summarizing, and
interpreting data.
– This includes design issues as well.
Branches of Statistics
◼ Descriptive statistics
– Gives numerical and graphic procedures to
summarize a collection of data in a clear and
understandable way.
– Provide summary indices for a given data, e.g.
arithmetic mean, median, standard deviation,
coefficient of variation, etc.
◼ Inductive (inferential) statistics
– Provides procedures to draw inferences about a
population from a sample

sample Population

Estimating population values from sample values 4


Why need biostatistics?
◼ Main reason: handling variations
– Biological variation
• Attribute differ not only among individuals
but also within same individual over time
• Example: height, weight, blood pressure,
eye color ...
– Sample variation
• Biomedical research projects are usually
carried out on small numbers of study
subjects 5
Role of biostatistics in Epidemiology
◼ Epidemiology is the study of the distribution and
determinants of health-related states or events
(including disease), and the application of this study to
the control of diseases and other health problems.
◼ Essential for scientific method of investigation
– Formulate hypothesis
– Design study to objectively test hypothesis
– Collect reliable and unbiased data
– Process and evaluate data rigorously
– Interpret and draw appropriate conclusions
◼ Essential for understanding, appraisal and
critique of scientific literature.
6
What is Data?
Variable

◼ Any measurable characteristic that


assumes different values for different
subjects, e.g., age, height, hair colour,
gender
◼ Observation of variables on different
subjects gives rise to data
Types of data

◼ Qualitative (Categorical data)


– Gender, disease severity
◼ Quantitative (Measurement) data
– Age, BP,Weight
Categorical Data

◼ The variable being studied are grouped


into categories based on some
qualitative trait.
◼ The resulting data are merely labels or
categories.
Examples: Categorical Data

◼ Hair color
– blonde, brown, red, black, etc.
◼ Opinion of students about riots
– ticked off, neutral, happy
◼ Smoking status
– smoker, non-smoker
Categorical data classified as Nominal,
Ordinal, and/or Binary

Categorical data

Nominal Ordinal
data data

Binary Not binary Binary Not binary


Nominal Data

◼ A type of categorical data in which objects fall


into unordered categories. E.g.
◼ Hair color
– blonde, brown, red, black, etc.
◼ Race
– Caucasian, African-American, Asian, etc.
◼ Smoking status
– smoker, non-smoker
Ordinal Data

◼ A type of categorical data in which


order is important. Examples
◼ Education level
– None, Primary, Post primary
◼ Degree of illness
– none, mild, moderate, severe
Binary Data
◼ A type of categorical data in which there
are only two categories.
◼ Binary data can either be nominal or
ordinal. Examples
◼ Smoking status
– smoker, non-smoker
◼ Education
– Primary, Post primary
Measurement Data

◼ The variables being studied are


“measured” based on some
quantitative trait.
◼ The resulting data are set of numbers.
Measurement data classified as
Discrete or Continuous

Measurement
data

Discrete Continuous
Discrete Measurement Data
Only certain values are possible (there
are gaps between the possible values).

Continuous Measurement Data

Theoretically, any value within an interval is possible


with a fine enough measuring device.
Discrete data -- Gaps between possible values

0 1 2 3 4 5 6 7
Continuous data -- Theoretically,
no gaps between possible values

0 1000
Discrete Measurement Data
Examples
◼ Number of pregnancies
◼ Number of students late for class
◼ Number of crimes reported
◼ Number of huts in a sampled rural home
◼ CD4 counts

Generally, discrete data are counts.


Continuous Measurement Data
Examples
◼ Cholesterol level
◼ Height
◼ Body weight
◼ BP

Generally, continuous data come from


measurements.
Descriptive Statistics

A first step to summarizing


or describing raw data
What to describe?

◼ What is the “location” or “center” of the


data? (“measures of location”)

◼ How do the data vary? (“measures of


variability”)
Measures of Location
Measures of location indicate where on the
number line the data are to be found.
Common measures of location are:
◼ Mean
◼ Median
◼ Mode
Mean
◼ Another name for average.
◼ Let X1,X2,X3,…,Xn be the realised
values of a variable X, from a sample of
size n. Then the mean is
Formula:  Xi
X= n
That is, add up all of the data points and divide
by the number of data points.
Median

◼ Another name for 50th percentile


◼ ( Middle value).
◼ Appropriate for describing measurement
data.
◼ “Robust to outliers,” that is, not
affected much by unusual values.
Example

The systolic blood pressure of seven


middle aged men were as follows:
151, 124, 132, 170, 146, 124 and 113.
X =
(151 + 124 + 132 + 170 + 146 + 124 + 113)
7

The mean is
= 137.14
Median

◼ Also known as the 50th percentile or


simply the middle value
◼ If the sample data are arranged in
increasing order, the median is
(i) the middle value if n is an odd number, or
(ii) midway between the two middle values if
n is an even number
Example 1. Median- n is odd
The reordered systolic blood pressure data
seen earlier are:

113, 124, 124, 132, 146, 151, and 170.

Median=132
Example 2. Median if– n is even
Six men with high cholesterol participated in a study to
investigate the effects of diet on cholesterol level. At the
beginning of the study, their cholesterol levels (mg/dL)
were as follows:
366, 327, 274, 292, 274 and 230.
Rearrange the data in numerical order as follows:

230, 274, 274, 292, 327 and 366.

The Median is half way between the middle two readings,


i.e. (274+292)  2 = 283.
Quartiles
◼ Quantiles: dividing the distribution of
ordered values into 4 equal-sized parts

First 25% Second 25% Third 25% Fourth 25%


Q1 Q2 Q3

Q1: first quartile


Q2 : second quartile = median
Q3: third quartile
31
Mode

◼ The value that occurs most frequently.


◼ One data set can have many modes.
◼ Appropriate for all types of data, but
most useful for categorical data or
discrete data with only a few number of
possible values.
The most appropriate measure
of location depends on …

the shape of the data’s


distribution.
Most appropriate measure of
location

◼ Depends on whether or not data are


“symmetric” or “skewed”.
◼ Depends on whether or not data have
one (“unimodal”) or more
(“multimodal”) modes.
Choosing Appropriate Measure of
Location
◼ If data are symmetric, the mean,
median, and mode will be approximately
the same.
◼ If data are multimodal, report the mean,
median and/or mode for each
subgroup.
◼ If data are skewed, report the median.
Mean versus Median
◼ Large sample values tend to inflate the
mean. This will happen if the
histogram of the data is right-skewed.
◼ The median is not influenced by large
sample values and is a better measure
of centrality if the distribution is
skewed.
Mean versus Median
◼ Median is less sensitive to extreme
values

x1 87 87
x2 95 95
Median is unchanged x3 98 98
x4 101 101
x5 105.0 1050

37
Measures of Variation
◼ Summarize the dispersion of individual
values from some central value like the
mean
◼ Measures of dispersion characterise how
spread out the distribution is, i.e., how variable
the data are.
mean
x
x
x
x
x
x
38
Indices of Variation
◼ Commonly used measures of
dispersion include:
– Range
– Variance & standard deviation
– Inter-quartile range (IQR)
– Coefficient of Variation (or
relative standard deviation)
Range

◼ R= largest obs. - smallest obs.


or, equivalently
R = xmax - xmin
or, at times present

R = (xmin ,xmax )
Inter-quartile Range
◼ IQR = third quartile - first quartile
or, equivalently
IQR = Q3 - Q1
Q1 =lower quartile (has 25% of data
below and 75% above)
Q3=upper quartile (has 75% of data
below and 25% above)
IQR:-Example

◼Consider the ages of 8 patients


18,21,23,24,24,32,42,59
Q1 =22 , Q3= 37
IQR=37-22=15
Variance
◼ Variance of a population : average of
squares of deviation from the mean
n

 (
i =1
Xi − X ) 2

n
◼ Variance of a sample: usually subtract 1
from n in the denominator
n

 (
i =1
Xi − X ) 2

n −1 effective sample
size, also called 43
degree of freedom
Standard deviation
◼ Problem with variance: its awkward unit
of measurement as value are squared
◼ Solution: taking square root of variance
=> standard deviation
◼ Sample standard deviation ( s or sd)

(x − x)
2
i
s= s =
2 i =1

n −1 44
What is a standard deviation?
◼ it is the typical (standard) difference
(deviation) of an observation from the mean
◼ think of it as the average distance a data
point is from the mean, although this is not
strictly true
Example
Data Deviation Deviation2
151 13.86 192.02
124 -13.14 172.73
132 -5.14 26.45
170 32.86 1079.59
146 8.86 78.45
124 -13.14 172.73
113 -24.14 582.88
Sum = 960.0 Sum = 0.00 Sum = 2304.86
x = 137.14
Example (contd.)
7

 (x − x ) = 2304.86
2
i
i =1

2304.86
Therefore, s=
7 −1
= 19.6
Standard deviation
◼ Caution must be exercised when using
standard deviation as a comparative index of
dispersion
Weights of Weights of
newborn elephants newborn mice (kg)
(kg) 0.72 0.42
929 853 n=10 n=10
0.63 0.31
878 939 X =887.1 X = 0.68
0.59 0.38
895 972 sd =56.50 sd = 0.255
0.79 0.96
937 841
1.06 0.89
801 826

Incorrect to say that elephants show greater


variation for birth-weights than mice because of
48
higher standard deviation
Coefficient of variance
◼ Coefficient of variance expresses
standard deviation relative to its mean
s
cv =
X
cvelephants = 0.0637

cvmice = 0.375

Mice show greater birth-weight variation


49
Measures of Variation -
Some Comments
◼ When comparison groups have very
different means (CV is suitable as it
expresses the standard deviation
relative to its corresponding mean)
◼ When different units of measurements
are involved, e.g. group 1 unit is mm,
and group 2 unit is gm (CV is suitable
for comparison as it is unit-free)
◼ In such cases, standard deviation
should not be used for comparison. 50
Measures of Variation -
Some Comments
◼ Range is the simplest, but is very
sensitive to outliers
◼ Variance units are the square of the
original units
◼ Interquartile range is mainly used with
skewed data (or data with outliers)
◼ standard deviation is the most
commonly used measure of variation.
Outliers
.
◼ An outlier is an observation which does not
appear to belong with the other data
◼ Outliers can arise because of a
measurement or recording error or because
of equipment failure during an experiment,
etc.
◼ An outlier might be indicative of a sub-
population, e.g. an abnormally low or high
value in a medical test could indicate
presence of an illness in the patient.
Q&A

◼ Thank you for your attention!

53

You might also like