Statistics Notes
Statistics Notes
Statistics: It is a broad subject, with applications in a vast number of different fields. Generally
speaking it is the methodology of collecting, analysing, interpreting and drawing conclusions from
information. It provides methods for
Variable: it is any characteristic that varies from one individual member of the population to
another. Ex are height, weight, age, marital status etc.
Types of variables
Quantitative variable: they yield numerical information. Ex is age, height etc. it is further
divided into Discrete and Continuous.
Discrete variable: it can take only specified values- often integers. No intermediate numbers
are possible. Ex is number of children, students taking exam etc.
Continuous variable: data are not restricted on taking any certain specified value. Fractional
values are possible. Ex is height, weight etc. Weight can be measured accurately to the tenth
of a gram.
Qualitative variable: they yield non numerical information. Ex is marital status, sex etc.
Tables –It is a simplest means of summarising a set of observations and can be used to represent all
types of data
Frequency distribution
o The number of observations that fall into a particular class of the qualitative variable
is called as frequency. A table listing all the classes and their frequencies as called a
frequency distribution.
o For Nominal and Ordinal data, a frequency distribution consists of a set of classes or
categories with the numerical count that corresponds to each one.
o To display Interval or Ratio data, data must be broken down into a series of distinct,
non overlapping intervals called as class intervals (CI).
o If there are too many class intervals, the summary is not too much of an
improvement over raw data. If there are too few class intervals then a great deal of
information is lost.
o Usually class intervals are constructed so that all have equal width; this facilitates
comparison between the classes.
o Once the upper and lower limit for each class interval is selected, the number of
values that fall within that pair of limit is counted and the result is arranged as a
table.
Relative frequency
o This gives the proportion of values that fall into a given interval in frequency
distribution.
The scale gives certain structure to the variable and also defines the meaning of the variable.
There are four types of scale Nominal, Ordinal, Interval and Ratio scales
Nominal:
o Simplest type of data.
o Values fall in unordered categories or classes.
o Nominal refers to the fact the categories are merely names.
o Used to represent Qualitative data.
o Ex include gender- male & female, blood groups- A, B, AB & O
o If the data has only two distinct values then it is called as Binary/ Dichotomous.
Ordinal:
o If the same nominal can be put into an order then it is called as Ordinal data.
o Used to represent Qualitative data.
o Ex include depression as mild, moderate & severe.
o Here a natural order exists among the groups, but the difference between the
groups is not necessarily equal.
Interval:
o Here the data can be placed in a meaningful order and there is equal difference
between the groups.
o Ratio of the measurements cannot be done, that is there is no absolute zero.
o Used to compare Quantitative data.
o Ex include temperature measured in centigrade, time (even though 0000hrs exist,
there is nothing like no time)
Ratio:
o Here there is comparable difference between the variables as well as there is
absolute zero.
o Ratio of the measurements also can be done.
o Used to measure Quantitative data.
o Ex include temperature measured in Kelvin.
Properties of data-
Sampling
What do you understand by sample size in a research study? What are the statistical
methods used to determine the sample size? (5+5)
Introduction- Planning a study with a sample size calculation is an important aspect. It is not
possible to study the whole population in any study. Thus a set of participants, who
represent the population, are taken and study is conducted on them. This represents a
sample; the results thus obtained on sample are extrapolated to the whole population.
Definitions-
Sample- it is a subset of a population on which the statistical studies are made in order to
draw conclusions relative to the population.
Every individual in the population should have an equal chance to be included in the
sample.
Choice of one participant should not affect the chance of another’s selection.
Problems associated with not estimating sample size - Randomly taking a sample size gives
rise to problems as-
1. Ethical problems- too many patients are given inferior treatment, which is not
ethically correct.
2. Scientific problems- negative research require the repetition of the research to firmly
prove the point.
3. Financial problems- extra costs are involved in too small and too large studies.
Level of significance (“p” value):- Prior to starting of the study an acceptable value is
set. Usually it is accepted at p <0.05, it means we are ready to accept 5% results by
chance factor. In another words 5% of false positive is accepted.
Power of the study (β error):- Here we fail to detect a difference when there is
actually a difference. This is called false negative. This must be decided before the
study starts. Usually most studies accept a power of 80%, it means that in one in five
(i.e is 20%) times we will miss a real difference. For some important studies it is set
at 90%, to reduce the possibility of false negatives to 10%.
Expected effect size:- It is the difference between the value of variable in the control
group and that in the test group. It can be calculated from the previous reported or
preclinical studies.
Underlying event rate in the population:- Estimated from the previous studies.
Standard deviation in the population:- It is the measure of dispersion or variability in
the data. It is easy to understand that we require a smaller sample if the population
is more homogenous and therefore has a smaller variance and standard deviation.
Mean should be atleast 1.96 or approx 2 SEMs distant from 0 to obtain statistical
significance.
To calculate the sample size, we need to know statistical power, mean difference
and variance in the data estimated as SD or SEM.
The relationship between the elements can expressed with, (zα+zβ)2 = Power index,
being the important element as: n = (SD / mean)2 (zα+zβ)2
Zα means a place on Z line, if α is defined as 5%, then right from this place on the Z
line AUC (Area Under Curve) = 5%.so this place must be 1.96 SEMs distant from 0. So
Zα = 1.96 or approx 2.0.
Zβ, where β is defined 20%, it is right form this place the AUC= 20% of the total AUC.
This place is approx 0.6 SEMs distant from 0. So Zβ = approx 0.8.
So power index can be calculated as, power index = (zα+zβ)2 = (2.0+0.8)2 = 7.8. So for
continuous variable with α = 5% and power= 1-β = 80% the required sample size is-
Parallel group studies including 2 groups larger sample sizes are required. Each group
produces its own mean and standard deviation (SD).
The equations for continuous data and proportions are very similar-
For parallel group studies with two proportions again SDs are pooled-
References-
1. Kadam P & Bhalerao S. Sample size calculation. Int J Ayurveda Res. 2010 Jan-Mar;
1(1): 55-57.
2. Statistics applied to clinical studies, 5th edition by Ton J.Cleophas & Aeilko
H.Zwinderman. Springer.
In any distribution, majority of the observations pile up, or cluster around in a particular
region. This is referred to as the central tendency of the distribution.
So it is a statistical measure to determine a single score that defines the centre of the
distribution.
It makes the large amount of information easily understandable.
There are three types of central tendency- Mean, Median & Mode
Mean and Median can be applied only to Quantitative data, whereas Mode can be used with
either Quantitative or Qualitative data.
Mean-
Most frequently used measure of central tendency
It is the sum total of all the observations divided by total number of observations.
It is stable average based on all observations.
Calculated for Quantitative data, measured in interval/ ratio scale
It is affected by extreme values (outliers), because it is extremely sensitive to
unusual values.
Reliability
Definition: Reliability refers to the ability of a measurement instrument to produce the same
result on repeated measurement.
Types of reliability:
1. Scorer/ Inter-rater reliability:
It refers to ability of measurement instrument to produce the same result
when administered by two different raters.
It is the probability that 2 raters (i) will give the same score to given answer,
(ii) rate a given behaviour in the same way, (iii) add up the score properly.
Scorer reliability should be near perfect.
2. Test-Retest reliability:
It assesses the ability of a measurement to arrive at the same result for the
same subject on repeated administrations.
Validity
Definition: the degree to which a test measures what it is supposed to measure is known as
validity.
Types of validity:
1. Face validity:
It refers to whether test seems sensible to the person completing it, i.e.
does it appear to measure what it is meant to be measuring.
2. Content validity:
It refers to the degree to which the test measures all the aspects of the item
that is being assessed.
For example, test for depression should have questions asking depressive
symptoms.
3. Concurrent validity:
It reveals that at a given point of time, high scorers on a test may be more
likely than low scorers.
Normal distribution
Also called as Bell shaped curve or Gaussian distribution, after the well known
mathematician Karl Freidrich Gauss.
It is the most common and widely used continuous distribution
Bell shaped curve can be obtained by compiling data into a frequency table and graphing it
in a histogram.
Normal distribution is easy to work with mathematically. In many practical cases, the
methods developed using normal theory work well even when the distribution is nearly
normally distributed.
Standard normal distribution (Z distribution), is used to find probabilities and percentiles for
regular normal distributions. It serves as a standard by which all other normal distributions
are measured.
It is a normal distribution with mean “0” and standard deviation of “1”.
Properties of the standard normal curve
o Its shape is symmetric.
o Area of the curve is greatest in the middle, where there hump and thins out towards
the tails.
o It has mean denoted by µ (mu) and standard deviation denoted by б (sigma).
o Total area under the curve is equal to 1.
Types of error
Aim of doing a study is to check whether the data agree with certain predictions. These
predictions are called hypothesis.
Hypothesis arises from the theory that drives the research. These are formulated before
collecting the data.
Hypothesis testing will help to find if variation between two sample distributions can just be
explained through random chance or not. Before concluding that two distributions vary in a
meaningful way, precautions must be taken to see that the differences are not just through
random chance.
There are two types of hypothesis- Null (H0) and Alternative hypothesis (H1).
A parametric test is one that makes assumptions about the parameters (defining properties)
of the population distribution from which one’s data are drawn.
These tests assume that data is normally distributed and rely on group means.
Most well known elementary statistical methods are parametric.
These tests assume more about a given population. When the assumptions are correct, they
produce more accurate and precise estimates.
When the assumptions are not correct these tests have a greater chance of failing, and for
this reason are not robust statistical methods.
Parametric formulae are often simpler to write down and faster to compute, and for this
reason they lack robustness.
Examples of Parametric tests are t-tests and ANOVA.
Advantages over Non parametric tests
1. Statistical power- parametric test have more statistical power than non parametric
tests. Thus more likely to detect a significant effect when one really exists.
2. Parametric perform well when the spread of each group is different- this is because
even though non parametric tests don’t assume that data follow normal
distribution, they have other assumptions like the data for all groups must have the
same spread.
3. Parametric tests can perform well even with skewed and non normal distributions,
provided the data satisfies some guidelines about sample size.
Student t-test
Developed by WS Gosset, a chemist working for the Guinness brewery in Dublin, Ireland
("Student" was his pen name).
It can be used to determine if two sets of data are significantly different from each other.
A t-test is any statistical hypothesis test in which the test statistic follows a Student's t
distribution if the null hypothesis is supported.
The t-distribution is similar to standard normal distribution, but it’s shorter and flatter than
the standard normal distribution (Z distribution). But as the sample size increases (degrees
of freedom) the t distribution approaches the standard normal distribution.
Degree of freedom is always a simple function of sample size that is (n-1).
Particular advantage of the t-test is it does not require any knowledge of the population
standard deviation. So it can be used to test hypothesis of a completely unknown
population, and the only available information about the sample comes from the sample.
All that is required for a hypothesis test with t-test is a sample and a reasonable hypothesis
about the population mean.
Types of Student t-test
Types of ANOVA depending on the number of treatments and the way they are applied to
the subjects in the experiment:
1. One-way ANOVA is used to test for differences among >=3 independent groups. E.g.
Group A is given vodka, Group B is given gin, and Group C is given a placebo. All groups
are then tested with a memory task.
Assumptions of ANOVA
Many of the tests that are used for analyses of data have a presumption that data must have
a normal distribution. When the data does not meet normal distribution, Non Parametric
tests are used.
These are those tests that don’t have any presumption about data.
These work with Median, which is a much more flexible statistic because it is not affected by
outliers.
Advantages of non parametric tests
1. Area of study is better represented by median. That is median better represents the
centre of distribution.
2. When the sample size is small. With small sample size it is not possible to ascertain
the distribution of data, so the parametric tests lack sufficient power to provide
meaningful results.
3. Presence of outliers. Parametric test can only assess continuous data and the results
can be significantly affected by outliers.
The use of statistics begins from even before the study is started. It starts from designing the
study like what type of study (retrospective/ prospective, case control/ cohort, cross
sectional/ follow up), how to collect the sample (simple/ stratified random, convenient) etc.
Sample size calculation is the most important before the study is actually started. There are
various methods in which it can be calculated. Various softwares are available to calculate
the same.
Once the sample size is obtained, the data is collected and has to be transferred onto a MS
Excel sheet or a SPSS data sheet.
SPSS data sheet helps in analysing the data easily. Even a MS Excel data sheet can be
converted into SPSS data sheet without much problem using SPSS software.
The data thus obtained is analysed using descriptive and inferential statistics.