Introduction to Statistics_Note
Introduction to Statistics_Note
WHAT IS STATISTICS?
Sir Ronald Fisher: Father of modern statistics.
Father of Indian Statistics: Prof. Prasanta Chandra Mahalanobis.
The word ‘Statistics’ and Statistical are derived from the Latin word ‘Status’, means
political state.
Statistics is a branch of mathematics.
Statistic is defined as a method of collecting, organizing, summarizing, analyzing
and interpreting the numerical data.
WHAT IS BIOSTATISTICS?
Father of Biostatistics – Sir. Francis Galton
Biostatistics is the branch of statistics applied to biological or medical sciences.
Biostatistics is the methods used in dealing with statistics in the field of health
sciences such as biology, medicine, nursing, public health etc.
APPLICATION OF BIOSTATISTICS
In medicine
In community medicine & public health
In Physiology & Anatomy
In Pharmacology
Genomics, population genetics, and statistical genetics in populations in order to link
variation in genotype with a variation in phenotype
Biological sequence analysis
Ecology, ecological forecasting
BRANCHES OF STATISTICS
► Statistics can be divided into two branches
1. Descriptive Statistics
Descriptive statistics are used to organize and summarize the data in the form of tables,
graphs and numbers.
2. Inferential Statistics
Statistical methods used to draw conclusion about the whole population using a sample is
called Inferential Statistics. It include the methods like estimation and testing of hypothesis.
(To reach decisions about a large body of data by examining only a small part of the data.
This is done by using various tests of hypotheses, magnitude of associations etc.)
DATA
Data is collection of information, it is collected from the population.
OR, a collection of numerical observations of known facts is called a data.
Data collection
Data collection is the process of acquiring information from different sources, about
the topic under research. The function is performed by the researcher himself of his/her
team.
SOURCES OF DATA
PRIMARY DATA
Primary Data are the first-hand data, collected by the researcher for the first time and
is original in nature.
OR, Primary data refers to the data collected by the researcher, for the very first time,
from different sources, with a particular problem, question or specific purpose in mind.
It is useful for current studies as well as for future studies.
The sources of primary data are primary units such as basic experimental units,
individuals, households.
SECONDARY DATA
Secondary Data is the Second-hand data which have been collected by someone else
and which is readily available from the other sources.
OR, Secondary Data is the data collected by any person, organization or agency in the
past through surveys, experiments or study, for some other purpose, but used by the
researcher to deal with the problem at hand.
It involves less cost, time and effort
These are usually in journals, periodicals, research publication, official record etc.
Primary Data are the first-hand data, Nature Secondary Data is the Second-hand data which
collected by the researcher for the first time have been collected by someone else and which is
and is original in nature readily available from the other sources.
Real-time Data Nature of data Past Dara
Pure and row Form Refined
Freshly collected for the project Facts and figures Already collected and recorded
Time consuming Process Quick and Easy
Related to the objective of the researcher Data Adjusted or used according to the need
Expensive Cost Economical
First-hand Information Second-hand
More Accuracy & Comparatively less
Reliability
Survey, Experiment, Interview, Observation, Source Books, Journals, Newspapers, Internal records,
Questionnaire Government Publications, Websites, etc
VARIABLES
A variable is any characteristics, number, or quantity that can be measured or
counted.
Examples: Age, Sex, Caste, Blood pressure, weight, height, etc.
Vatiables
Quantitative Qualitative
Variables Variables
Discrete Continuous
variables variables
a) Discrete variables assume exact values only and can be obtained by counting. Or a
quantitative variable that can assume a countable number of values.
Example: Total number of students in a class
b) Continuous variables assume infinite values within a specified interval and can be
obtained by measurement. Or a quantitative variable that can assume an uncountable
number of values.
Example: Height of the students.
SCALES OF MEASUREMENT
There are four principal scales used to measure data.
1. Nominal Scale
2. Ordinal Scale
3. Ratio Scale
4. Interval Scale
1.Nominal Scale
Nominal scales are used for labelling variables, without any quantitative value.
“Nominal” scales could simply be called “labels”.
Simply, the data are alphabetic or numerical in name only and does not include any
notation of measurement.
(A nominal scale usually deals with the non-numeric variables or the numbers that do not
have any value.
2.Ordinal Scale
The ordinal scale defines data that is placed in a specific order. While each value is
ranked, there’s no information that specifies what differentiates the categories from each
other.
These values can’t be added to or subtracted from.
(A qualitative variable that incorporates an ordered position, or ranking. Involves data
that may be arranged in some order.)
3.Ratio Scale
Data classified as the ratio of two numbers.
Quantitative classification.
Zero point of scale is absolute (data can be added, subtracted, multiplied, and divided)
4.Interval Scale
Data classified by ranking.
Quantitative classification
Zero point of scale is arbitrary (differences are meaningful).
(It is defined as a quantitative measurement scale in which the difference between the two
variables is meaningful. In other words, the variables are measured in an exact manner,
not as in a relative way in which the presence of zero is arbitrary.)
Example: Haemoglobin; 8-10, 10-12, 12-14
Fahrenheit temp. Scale, (on a temperature scale, the difference between 20 °C
and 30 °C is the same as the difference between 50°C ad 60°C.)
PRESENTATION OF DATA
Data can be represented in countless ways.
The presentation of data means exhibition of the data in such a clear and attractive
manner that these are easily understood and analysed.
The three main forms of presentation of data are:
1. Textual presentation
2. Tabular form
3. Diagrammatic and Graphical presentation
Example: Of the 150 sample interviewed, the following complaints were noted: 27
for lack of books in the library, 25 for a dirty playground, 20 for lack of laboratory
equipment, 17 for a not well maintained university buildings.
2. TABULAR FORM
Method of presenting data using the statistical table.
Tables, which convey information that has been converted into words or numbers in
rows and columns.
A more effective device of presenting data.
1. Stem and leaf plots
2. Frequency distribution table
3. Contingency table
Format of a Table
CONTINGENCY TABLE
A contingency table, also known as a cross-classification table, describes the relationships
between two or more categorical variables.
A table cross-classifying two variables is called a 2-way contingency table and forms a
rectangular table with rows for the R categories of the X variable and columns for the C
categories of a Y variable.
A contingency table having R rows and C columns is called an R x C table.
Example:
Smoke
Alcohol Yes No
consumption
Low 10 80
High 50 40
HISTOGRAM
Used for Quantitative, Continuous, Variables.
It is used to present variables which have no gaps.
It consist of a series of blocks. The class intervals are given along horizontal axis and the
frequency along the vertical axis.
Differences between bar diagram and histogram?
In a histogram no space is left in between two rectangles, but in a bar diagram some space
must be left between two consecutive bars.
We can have a bar diagram both for discrete and continuous variables, but the histogram is
drawn only for a continuous variable.
PIE CHARTS
The “pie chart” also is known as “circle chart.
Most common way of presenting data.
The value of each category is divided by the total values and then multiplied by 360 and
then each category is allocated the respective angle to present the proportion it has.
It is often necessary to indicate percentages in the segment as it may not be sometimes
very easy virtually, to compare the areas of segments.
LINE DIAGRAM
It is the simplest type of diagram. (line graph, line chart)
It is a chart that shows a line joining several points or a line that shows the relation
between the points.
A line graph is used for showing trends over a particular period of time.
The variable is taken in X-axis and Frequency of the observations on the Y- axis.
Used to illustrate the relationship between continuous quantities.
Used to compare two or more groups
SCATTER PLOTS
• A Stem and Leaf Plot is a special table where each data value is split into a "stem"
(the first digit or digits) and a "leaf" (usually the last digit).
• "32" is split into "3" (stem) and "2" (leaf).
SUMMARIZING DATA
Here,
Σ represents the addition of values
X represents each value in the data set
𝐱̅ represents the mean of the data set
n represents the number of data values
𝚺 𝐟𝐱
̅=
Arithmetic mean of grouped data; 𝐗
𝚺𝐟
MEDIAN
Median is defined as the middle value of any observation, when the values are
arranged in ascending or descending order. (The median of a set of data is the “middle
element” when the data is arranged in ascending order.)
OR The median is the value that is in the middle when the data points are sorted from
smallest to largest.
𝑛+1 𝑡ℎ
Odd ( ) observation
2
Median =
𝑛 𝑡ℎ 𝑛+1 𝑡ℎ
( ) +( )
2 2
Even observation
2
Dispersion
The measures of dispersion help to interpret the variability of data. There are two types of
measures of dispersion.
Measures of Dispersion
1. Range
2. Variance
3. Standard deviation
4. The coefficient of variation
RANGE
The range is the most straight forward measure of spread. It's the difference between the
largest observed value (the maximum) and the smallest observed value (the minimum).
VARIANCE
It is based on the squared distances between the values of the individual cases and the mean.
To calculate the squared distance between a value and the mean, just subtract the mean from
the value and then square the difference.
[𝚺( 𝐱− 𝐱̅)𝟐 ]
𝐕𝐚𝐫𝐢𝐚𝐧𝐜𝐞 =
𝐧−𝟏
Here,
Σ represents the addition of values
X represents each value in the data set
𝐱̅ represents the mean of the data set
n represents the number of data values
STANDARD DEVIATION
Take the square root of the variance and obtain what’s known as the standard deviation.
The Greek letter 𝝈 (“sigma”) is used as the symbol for population SD and the symbol ‘s’
(small letter ‘s’) used to represent the SD of a sample.
[ 𝚺( 𝐱 − 𝐱̅)𝟐 ]
𝐒. 𝐃 = √
𝐧−𝟏
Here,
Σ represents the addition of values
X represents each value in the data set
𝐱̅ represents the mean of the data set
n represents the number of data values
COEFFICIENT OF VARIATION
The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the
standard deviation to the mean (average)
Standard Deviation
Coefficient of Variation = × 100
Mean
𝜎
In symbols: CV = ̅ × 100
X