0% found this document useful (0 votes)
18 views64 pages

Unit-2 Ids

Uploaded by

nshreya09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views64 pages

Unit-2 Ids

Uploaded by

nshreya09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

CHAPTER II

VOCABULARY OF
STATISTICS
Basic Concepts

Those new to statistics sometimes find its


terminology difficult so at this stage it is
important to understand some new words
and concepts.
This chapter sets out to provide meaning
to some of the basic terms which are
essential in carrying out statistical
analysis.
SOME OF THE
STATISTICAL
EXPRESSIONS
Data: Data refers to any group of measurements
that helps in providing information.
Quantitative Data: Data that possess numerical
properties are known as quantitative data.
Qualitative Data: Qualitative data reflects non-
numeric features or qualities of experimental units.
Ex: colour, gender, good, high, low.
Statistics: Statistics is the use of data to help the
decision-makers to reach better decisions.
Variable: A variable is a characteristic that may
take on different values at different times, places or
situations. Ex: income, wages, population, no. of
SHGs, political parties, voters.
CONSTANTS, VARIABLES, CASES,
VALUES
DISCRETE AND CONTINUOUS
VARIABLES
VALUES
CATEGORIZING AND CODING THE
VARIABLES
NOMINAL SCALE
ORDINAL SCALE
INTERVAL SCALE
GROUPED AND UNGROUPED DATA
DESCRIPTIVE STATISTICS
INDUCTIVE STATISTICS
MULTIVARIATE STATISTICS
CONCEPTS OF DISTRIBUTIONS
FREQUENCY DISTRIBUTION
PROBABILITY DISTRIBUTION
SHAPES OF DISTRIBUTIONS
Constants
 In math and statistics, a constant is a number
that is fixed and known, unlike a variable which
changes with the context.

 A symbol which has a fixed numerical value is


called a constant. For example: 2, 5, 0, -3, -7
etc., are constants.

 Constant is a specific number or a symbol that


is assigned a fixed value. For example, in the
equation below, "y" and "x" are variables, while
the numbers 2 and 3 are constants.
y = 2x - 3
Variables, Cases, Values

In any study the researcher is concerned with a


particular population or universe.

 Population refers to a specific group of people or


institutions or occurrences or observations about
which the researcher wishes to make descriptive
or analytical statements.

 When resources are limited researcher will draw


sample observations either randomly or according
to some agreed strategy as the basis for
investigation.
Variables

 We must define characteristics of population or


sample units to understand the sample or
universe in a better way. Each characteristic of a
population is termed as variable because these
are attributes which vary between cases.

 Variable is a characteristic that may take on


different values for different cases at different
times, places or situations. Ex: incomes of
different individuals, votes obtained by different
political parties, population of different small
towns, no. of SHGs in each district etc.
Example:

Cases: Individuals such as X, Y, Z


Variables: Their respective gender is variable
since it is varying among individuals, Male= 1,
Female = 2
For some variables we can have more number of
categories.
Example: Number of children in a family
Cases: Family X, Y, Z
Variables: Their respective number of children is the
variable. Number of children may vary from ‘Zero to one
child, two children and so on.
Discrete and Continuous Variables

 A variable may be either continuous or discrete. A


continuous variable is, capable of manifesting every
conceivable fractional value within the range of possibilities,
such as height or weight of persons (Ex. 55.6, 60.4, 72.8
K.G).

 On the other hand, a discrete variable is that which can


vary only by ‘finite’ jumps and cannot manifest every
conceivable fractional value.

 In some categories the values cannot logically be


subdivided. For example the number of children in a
family can only take certain values such as 1, 2 or 3, size
of the family etc.
Values
These are the possible outcomes for a single Variable. They are
different for the different cases. Values can be numbers or
named categories. For example the variable Gender has two
values, "male" and "female". Some people (cases) are men,
and
some are women.
Example:
Cases: Individuals such as X, Y, Z
Variables: Their respective gender is variable since it is varying
among individuals, Male= 1, Female = 2.
Values: values for the variable gender are 1 & 2 which are
assigned or coded for our convenience.
In the case of ‘incomes’ values are actual numbers:
For Case X, Income is the Variable and Rs.25,000/-is the Value.
For Case Y, Income is the Variable and Rs.34,000/- is the Value.
TYPES OF VARIABLES
Categorizing and coding the Variables

An important stage of the research process is the allocation of a


numerical values to each variable.
This is called coding, for example non-literates= 0 and literates= 1.
The very process of coding facilitates the researcher to categorize the
population or sample observations.

Categorical variables:
Categorical variables have values that describe a 'quality' or
'characteristic' of a data unit, like 'what type' or 'which category'.

There are four levels of measurement or scales to measure the


variables.

a) Nominal scale: Variables measured on the nominal scale are


essentially qualitative rather than quantitative in form.

The values of variables are categories not mere numbers and cannot
be ordered in any mathematically meaningful way.
 A nominal variable with only two possible values is referred
to as a dichotomous variable. An example might be if we
asked a person if they owned a mobile phone. Here, we
may categorize mobile phone ownership as either "Yes" or
"No".

 We can as well have more categories for a variable, such


as religious belief. These are called polytomous variables.

Hindu1
Muslim 2
Christian 3
Buddhist 4

 Here in nominal scale each value of the variable represents


a category, they imply no particular order or relationship
between the values.
b) Ordinal scale:
Nominal scale of measurement permits only
classification of the observations into different
categories, whereas ordinal scale of measurement
permits the ordering of those categories into ranks or
scale.
We can distinguish between the values in terms of
degree but cannot measure the degree of difference
between them.

Example: A group of workers opinions about the work


environment.
Very poor 0
Poor 1
Satisfactory 2
Good 3
Very good 4
c) Interval scale: Interval scale implies both an ordering of
categories and a measure of the distance between them.
The differences between points on the scale are measurable
and exactly equal.

Example: Number of absents each employee had in an


organization in a month.

No absences 0
One day 1
Two days 2
Three days 3 and so on.

 The number of days here are categories which are ordered


and allows us to measure exactly in a standard unit that
three days is more than one day but less than six days. Four
days absence is twice as many as two days and so on.
d) Ratio scale:
A ratio scale is a quantitative scale where there is a true
zero and equal intervals between neighboring points.
Unlike on an interval scale, a zero on a ratio scale means
there is a total absence of the variable you are
measuring.

Ex: Length, area, and population are examples of ratio


scales.
Ratio scales are one of the most common ways to depict
scale on maps. It tells the map reader that one unit on
the map is equal to a certain number of units in the real
world. Example: 1:2500. For example, 1:2500 means
that 1 cm = 2500 cm
Few more examples:
 Income, height, weight, unemployment rate, and
crime rate are examples of ratio data. As an analyst,
you can say that a crime rate of 10% is twice that of
5%.

 A ratio data example can be, that the weight of 90 kg


is twice 45kg. No negative value: On a ratio scale
there cannot be any negative value. The zero is the
starting point in a ratio scale which means numerical
value less than zero cannot exist. For example, a
person cannot have negative height.
 Age is typically considered to be measured on a
ratio scale. This is because age has a true zero
point, which means that a value of zero represents
the absence of age.

 In addition, it is possible to perform mathematical


operations such as addition, subtraction,
multiplication, and division on age values.

 The most common examples of ratio scale are


height, money, age, weight etc.
DIFFERENCE BETWEEN INTERVAL
VS RATIO SCALE
 The difference between interval vs ratio scale
comes from their ability to dip below zero. Interval
scales hold no true zero and can represent values
below zero. For example, you can measure
temperatures below 0 degrees Celsius, such as -10
degrees.
 Ratio variables, on the other hand, never fall
below zero.
Attributes:
An attribute refers to the quality of a characteristic.
The theory of attributes deals with qualitative types of
characteristics that are calculated by using
quantitative measurements.
Therefore, the attribute needs slightly different kinds
of statistical treatments, which the variables do not
get.
For example, eye color is an attribute of a person.
Attributes refer to the characteristics of the item under
study, like the habit of smoking, or drinking. So
‘smoking’ and ‘drinking’ both refer to the example of
an attribute.
In statistics classifying data based on attributes or
characteristic is known as qualitative classification of
data. Example of attributes are region, caste etc.
Grouped and Ungrouped Data

Ungrouped Data: The data obtained in original form are


called raw data or ungrouped data.
Example: The marks obtained by 25 students in a class
in a
certain examination are given below;

25, 8, 37, 16, 45, 40, 29, 12, 42, 25, 14, 16, 16, 20, 10,
36,
33, 24, 25, 35, 11, 30, 45, 48.

This is ungrouped data which is in original form without


any
ordering or grouping.
Grouped Data

To put the data in a more condensed form, we make


groups of suitable size, and mention the frequency of each
group. Such a table is called a grouped frequency
distribution
table. Here we aggregate or group the data into ordered
categories.

Employees age No. of cases


16-20 years 47
21-30 years 95
31-40 years 67
41-50 years 71
Descriptive statistics

Descriptive statistics provide a concise summary of


data. We
can summarize data numerically i.e. in the form of
tables or
graphically in a meaningful way by using descriptive
statistics.
Statistic value
Example: MarksMean
obtained by 65
students in a class.

Standard 7
deviation
Range 14

N (sample size) 50
Descriptive statistics summarize and organize characteristics of
a
data set. A data set is a collection of responses or observations
from a sample or entire population.

In quantitative research, after collecting data, the first step of


statistical analysis is to describe characteristics of the
responses, such as the average of one variable (e.g., age), or
the
relation between two variables (e.g., age and creativity).
Typically, there are two general types of descriptive statistics
That are used to describe data apart from tables and graphs.

a) Measures of central tendency: To describe the


characteristic of the entire mass of heavy data with a single
value we use descriptive statistics such as Mean, median and
mode.

b) Measures of spread: To understand scatter, spread or


Variation we use measures of spread or measures of variation
such as Range, quartiles, absolute deviation, variance
and standard deviation.
Inductive Statistics

 The branch of statistics dealing with generalizations,


predictions, estimations and arriving at conclusions
based on data from sample is called inductive
statistics.

 When we do this we are inducing or inferring the


characteristics of the population from the
characteristics of the sample.

 The purpose of inductive statistics is to assist the


researcher to assess how representative a sample is
from the population. Inductive statistics are also
commonly called inferential statistics.
Example: alpha=0.05
Here in inductive statistics we discuss the
Following:

 Why we use sample


 Various sampling procedures such as random and
non-random sampling methods
 Random sampling error, bias
 Estimating the population mean from the sample
mean, normal distribution, standard error,
confidence levels, testing of hypothesis etc.
 Concepts of Distributions
BIVARIATE STATISTICS

Bivariate analysis is a statistical method examining


how two different variables are related. The bivariate
analysis aims to determine if there is a statistical link
between the two variables and, if so, how strong and
in which direction that link is.

It is one of the simplest forms of statistical analysis,


used to find out if there is a relationship between two
sets of values. It usually involves the variables X and Y.
Bivariate analysis is the analysis of exactly two
variables.

Multivariate analysis is the analysis of more than


two variables.
The results from bivariate analysis can be stored in a
two-column data table. For example, you might want
to find out the relationship between caloric intake and
weight.
TYPES OF BIVARIATE ANALYSIS
Scatter plots:
These gives you a visual idea of the pattern that
your variables follow.
Correlation Analysis:
Correlation analysis, also known as bivariate,
describes the relationship between two variables or
two sets of data.
This relationship can be perfect positive, strong
positive, weak positive, no correlation (Zero), weak
negative, strong negative, or perfect negative.
Regression Analysis:
Regression analysis is a statistical method that shows
the relationship between two variables in bivariate
analysis.
Regression Analysis usually expressed in a graph, the
method tests the relationship between a dependent
variable against independent variables.
Multivariate Statistics

 Multivariate statistics is a subdivision


of statistics about the simultaneous observation and
analysis of more than one variable.

Example: Four of the most common multivariate


techniques
are multiple regression analysis, factor analysis, path
analysis
and multiple analysis of variance or MANOVA, ANOVA.
Concepts of Distributions

Frequency Distribution
 Frequency distribution is a representation, either in a
graphical or tabular format that displays the number of
observations within a given interval. Ex: bar diagram

 A frequency distribution refers to data classified on the


basis of some variable that can be measured such as
population(students, employees, farmers), wages, age
etc.

 Frequency tables often used to create histograms and


frequency polygons. The following are two examples of
discrete and continuous frequency distributions.
Ex: Discrete Variable Ex: Continuous
Variable

No.of No.of Weight (Kg’s) No.of


Children Families Persons

0 10 40-50 20

1 40 50-60 45

2 100 60-70 40

3 15 70-80 25
Probability Distribution

 For a discrete random variable, a probability


distribution
contains the probability of each possible outcome. The
sum
of all probabilities is always 1.0.
 Examples of discrete random variables include the
number of children in a family, the Friday night
attendance at a cinema, the number of defective
light bulbs in a box.
Examples of Probability Distributions: Binomial
distribution,
Poisson
A variable distribution(discrete),
whose value is determined by Normal distribution
the outcome of a random
(continuous
experiment is called a random variable. A random variable is also
known as chance variable.
probability distribution).
Shapes of Distributions

 Distributions have different shapes. A distribution


with the longer tail extending in the positive
direction is said to have a positive skew. It is also
described as "skewed to the right”.
 Some distributions have a negative skew. Since the
tail of the distribution extends to the left, this
distribution is skewed to the left.
 Skewness tells us the direction of outliers. In a
positive skew, the tail of a distribution curve is
longer on the right side. This means the outliers
of the distribution curve are further out towards
the right and closer to the mean on the left.
 The reason we get skewed distributions is
because data is disproportionally distributed.
Specifically, the majority of the data is clustered
in one area, and there are one or more outliers
away from the majority of the data. Outliers are
data points that are unlike most of the rest of the
data.
Thank You
CHAPTER-III
TYPES AND COLLECTION
OF DATA
Secondary and Primary Data
Concepts of Cross-section, Time
Series, Panel data-balanced and un-
balanced
Tools of Collecting Primary Data

Questionnaires and Schedules

Editing and Coding


COLLECTION OF DATA

Collection of data constitutes the first step in


a statistical investigation. Utmost care must
be exercised while collecting data because
data constitute the foundation, on which the
superstructure of statistical analysis is built.

Data may be obtained either from the


primary source or the secondary source. A
primary source is one that itself collects the
data; whereas secondary source is one that
makes available data which were collected
by some other agency.
PRIMARY AND
SECONDARY
DATA

Depending upon the source,


statistical data are classified
under two categories.

(i) Primary data


(ii) Secondary data
PRIMARY DATA

Primary data are obtained by a study


specifically designed to fulfill the needs of
the problem at hand.

Such data are original in character and are


generated in large number of surveys
conducted mostly by government and also
by some individuals, institutions and
research bodies.

 For example, data obtained in a


population census by office of the census
commissioner are primary data.
SECONDARY DATA

Data which are not originally collected but


rather obtained from published or
unpublished sources are known as secondary
data.

For example, for census commissioner


census data are primary, whereas for all
others who use such data, they are
secondary.
DIFFERENCES BETWEEN
PRIMARY AND SECONDARY
DATA
The difference between primary and
secondary data is only of degree. Data
which are primary in the hands of one
becomes secondary in the hands of
another.
Differences:

Data are primary for the individual agency


or institution collecting the data, whereas for
the rest of the world the data are secondary.

A few examples would clarify the


distinction between primary and secondary
data. Suppose an investigator wants data
about the spending habits of the students of
a university.
Ifhe or she collects the data himself/herself or
through his/her agents by interviewing the
students or circulating a questionnaire then the
data would constitute primary data for them.

On the other hand, if the students union has


already made a similar survey and the
investigator obtained data from students union
office such data is called secondary data for
him or for her.

Similarly statistics collected by various


departments of the government such as Labour
Bureau and Central Statistical Organisation
(C.S.O), Bureau of Economics and Statistics are
primary for the respective departments
whereas for all others they constitute secondary
data.
SOURCES OF SECONDARY
DATA
In most of the studies the investigator finds it
impracticable to collect firsthand information
on all related issues and as such he makes
use of the data collected by others.

There is vast amount of published


information from which statistical studies
may be made.

The sources of secondary data can


broadly be classified under two heads.
(i) Published sources
(ii) Unpublished sources
PUBLISHED SOURCES
1) Reports and official publications of various
bodies viz.

International bodies such as the World Bank,


International Labour Organisation (ILO), United
Nations (UN).

Central and State Governments reports such


as Statistical Abstracts.

Reportsof the Committees and Commissions


appointed by government such as
Narasimham committee on banking sector,
Seventh pay commission etc.
2) Semi-official publications of various local
bodies such as Municipal Corporations and
District Boards.

3) Publications of prestigious journals by


universities and institutes such as Indian
Journal of Agricultural Economics, Indian
Economic Review, Reserve Bank of India
Bulletin etc.

4) Annual Reports of R.B.I, NABARD, Stock


Exchanges, Corporations etc.

5) Publications brought out by various


autonomous research institutes such as
Institute of Economic Growth, Delhi, National
Council of Applied Economic Research
(NCAER), Delhi, Centre for Economic and
Itshould be noted that the publications
mentioned above vary regard to the
periodicity of publications.

Some are published at regular intervals


(yearly, monthly, weekly etc.) where as
others are ad-hoc publications i.e. with no
regularity about periodicity of publication.
UNPUBLISHED
SOURCES

Allstatistical material is not always


published. There are various sources of
unpublished data such as records maintained
by various government and private offices.

Studies made by research institutes,


scholars will fall under this category.
CROSS-SECTIONAL
DATA

Definition: Cross-sectional data is information that is


gathered at one point in time to reflect social
conditions.

Cross-sectional data, or a cross section of a


population, in statistics is a type of data collected by
observing individuals, firms, countries, or regions at
some point of time, or without regard to differences
in time.

Analysis of cross-sectional data usually consists


of comparing the differences among the subjects
(individuals, firms, countries, or regions).

Example: Number of habitations in a region in 1996.


For example, if we want to measure current obesity
levels in a population, we could draw a sample of
1,000 people randomly from that population. This
is also known as a cross section of that population.

If we measure their weight and height, and


calculate what percentage of that sample is
categorized as obese.

This cross-sectional sample provides us with a


snapshot of that population, at that point of time.
Note that we do not know based on one cross-
sectional sample if obesity is increasing or
decreasing; we can only describe the current
proportion.
TIME SERIES DATA

Time series data differs from cross-


sectional data, in which units of
observations are observed at various
points of time.

A time series is a collection of


observations made sequentially
through time. The interval between
observations can be any time interval
(hours within days, days, weeks,
months, years, etc).
Some areas of applications:

Time series can occur in a wide range of


fields from economics to sociology,
meteorology, geography to financial
investment, etc

Some examples of time series are:

- Malaria incidence or deaths over calendar


years, Covid-19 cases.

- Daily maximum temperatures

- Hourly records of babies born at a


maternity hospital

Can you suggest other examples?


Thank You

You might also like