0% found this document useful (0 votes)
60 views

RM Data Analysis

This document discusses data analysis concepts including variables, data sets, qualitative and quantitative variables, and descriptive statistics. It defines key terms like variable, data set, qualitative and quantitative variables. It also discusses different types of variables such as continuous and discrete variables. Finally, it provides examples of how to display and present data through methods like frequency distributions, charts, graphs, and summary statistics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

RM Data Analysis

This document discusses data analysis concepts including variables, data sets, qualitative and quantitative variables, and descriptive statistics. It defines key terms like variable, data set, qualitative and quantitative variables. It also discusses different types of variables such as continuous and discrete variables. Finally, it provides examples of how to display and present data through methods like frequency distributions, charts, graphs, and summary statistics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Data analysis

What do we analyze?
• Variable – characteristic that varies
• Data – information on variables (values)
• Data set – lists variables, cases, values
• Qualitative variable – discrete values, categories.
– Frequencies, percentages, proportions
• Quantitative variable- range of numerical values
– Mean, median, range, standard deviation, etc.
Concepts
• A variable is any measured characteristic or attribute
that differs for different subjects. For example, if the
weight of 30 subjects were measured, then weight
would be a variable.
Quantitative and Qualitative
• Quantitative variables are measured on
an ordinal, interval, or ratio scale
• Qualitative variables are measured on a nominal scale.
Qualitative variables are sometimes called "categorical
variables.”
• If five-year old subjects were asked to name their
favorite color, then the variable would be qualitative.
If the time it took them to respond were measured,
then the variable would be quantitative.
Concepts
• Continuous and Discrete
Some variables (such as reaction time) are
measured on a continuous scale. There is an
infinite number of possible values these variables
can take on.

• Other variables can only take on a limited number


of values. For example, if a dependent variable
were a subject's rating on a five- point scale where
only the values 1, 2, 3, 4, and 5 were allowed, then
only five possible values could occur. Such
variables are called "discrete" variables.
Data Analysis (1)
• Review research aims/objectives
This helps to focus the analysis

• Interpret results with care


Consider that some people may distort answers to
affect the outcome

• Look for facts and patterns

• Don’t ignore negative results

• Take time when analysing results


Data Analysis (2)
Quantitative Questions

• Collect data and input into spreadsheet


Excel is the simplest
• Display data using charts and graphs
• Use graphs and charts that are clear and easy to understand
• Run correlations to assess potential relationships
• Run regressions to establish causation
• Interpret data and explain any patterns and trends
May have to break down analysis to find patterns

Qualitative Questions

• Open questions are harder to interpret as they give unique and wide ranging
answers
No common way of analysing
• Organise comments into similar categories
• Attempt to identify patterns, or associations and relationships
• To analyse opinions on a spreadsheet it is best to use rating questions
Creating a data set
• May involve coding and data entry
• Coding = assigning numerical value to
each value of a variable
– Gender: 1= male, 2 = female
– Year in school: 1= primary, 2= O level, etc.
– May need codes for missing data (no
response, not applicable)
– Large data sets come with codebooks
Descriptive Statistics
Descriptive statistics can be used to summarize
and describe a single variable (UNIvariate)
• Frequencies (counts) & Percentages
– Use with categorical (nominal) data
• Levels, types, groupings, yes/no

• Means & Standard Deviations


– Use with continuous (interval/ratio) data
• Height, weight, scores on a test
Displaying and Presenting Data
• Frequency distribution – list of all possible
values of a variable and the # of times
each occurs
– May require grouping into categories
– May include percentages, cumulative
frequencies, cumulative percentages
Displaying and Presenting Data
• Ungrouped frequency distribution
– Usually qualitative variables
• Grouped frequency distribution
– Values are combined (grouped) into
categories
– Use for quantitative variables
– Many separate values
Frequencies
• Frequency distribution is a table that displays the
frequency of various outcomes in a sample. Each entry
in the table contains the frequency or count of the
occurrences of values within a particular group or
interval, and in this way, the table summarizes the
distribution of values in the sample.

• Assume in your survey, people were asked how many


cars were registered to their households. The results
were recorded as follows:

12 1 0 3 4 0 1 1 1 2 2 3 2 3 2 1 4 0 0
• Steps to be followed to present this data in a frequency
distribution table.
– Divide the results (x) into intervals, and then count the number
of results in each interval. In this case, the intervals would be the
number of households with no car (0), one car (1), two cars (2)
and so forth.
– Make a table with separate columns for the interval numbers
(the number of cars per household), the tallied results, and the
frequency of results in each interval. Label these columns
Number of cars, Tally and Frequency.
– Read the list of data from left to right and place a tally mark in
the appropriate row. For example, the first result is a 1, so place
a tally mark in the row beside where 1 appears in the interval
column (Number of cars). The next result is a 2, so place a tally
mark in the row beside the 2, and so on. When you reach your
fifth tally mark, draw a tally line through the preceding four
marks to make your final frequency calculations easier to read.
– Add up the number of tally marks in each row and record them
in the final column entitled Frequency
Frequency table for the Distribution of cars per household
number of cars registered 7
in each household 6

Frequency
5
4
Numbe Tally Freque 3
2
r of ncy (f) 1

cars (x) 0
0 1 2 3 4

0 |||| 4 Distribution of cars per Distribution of cars per


household household
1 ||||| 6
4 3, 14%
2 |||| 5 4, 19%

3 ||| 3 2
3, 14%

4 || 3 1 6, 29%

0 5, 24%

0 2 4 6 8

By looking at this frequency distribution table quickly, we


can see that out of 20 households surveyed, 4 households
had no cars, 6 households had 1 car.
Continuous à Categorical
Firm Distribution Elements
Distribution of
firm sizes
Tanzani Ugand
a a Kenya
Small (5-19 employees 55% 63% 31% Even though
Medium (20-99 employe 32% 29% 39% this is
continuous
Large (100 employees 13% 8% 30% data, it is
Young(<20yrs) 59% 67% 45% being treated
as “nominal”
Mid(21-50yrs) 38% 31% 46%
as it is broken
Old(>51) 3% 2% 9% down into
groups or
Tip: It is usually better to collect continuous data and then break it categories
down into categories for data analysis as opposed to collecting data
that fits into preconceived categories.
Ordinal Level Data
Frequencies and percentages can be
computed for ordinal data
– Examples: Likert Scales (Strongly Disagree to
Strongly Agree); A level/Vocational/Univ
Graduate/Post-graduate

60
50
40
30
20
10
0
Strongly Agree Disagree Strongly
Agree Disagree
Summary statistics
• Percent = relative frequencies;
standardized units.
• Cumulative frequency or percent =
frequency at or below a given category (at
least ordinal data required)
Visual Presentation of Data
• Bar graph (column chart, histogram): best
with fewer categories
• Pie chart: good for displaying percentages;
easily understood by general audience
• Line graph: good for numerical variables
with many values or for trend data
Bivariate histogram
Combining different units
Characteristics of firms and response to competition in Tanzania, Uganda and Kenya
70% 500

450
60%
400

50% 350

Number of firms
300
Proportion

40%
250
30%
200

20% 150

100
10%
50

0% 0
New pdct Innov ated New pdct Innov ated New pdct Innov ated
Tanzania Uganda Kenya

Exporters Small (5-19 employees Medium (20-99 employe Large (100 employees
Young(<20yrs) Mid(21-50yrs) Old(>51) No of firms
Unusual features
• Outliers • Gaps
Graphing data – scatter diagram
Graphing data – scatter diagram
Patterns of Data in Scatterplots

A scatterplot is a graphical way to


display the relationship between
two quantitative sample variables.

• Slope - direction of change in


variable Y with respect to
increase in value of variable X.
• Strength - Degree of spreadness
of scatter in the plot.
Summary statistics:
central tendency
• “Where is the center of the distribution?”
– Mode = category with highest frequency
– Median = middle category or score
– Mean = average score
Interval/Ratio Distributions
The distribution of interval/ratio data
often forms a “bell shaped” curve.
– Many phenomena in life are normally
distributed (age, height, weight, IQ).
Interval & Ratio Data
Measures of central tendency and measures of dispersion are often
computed with interval/ratio data

• Measures of Central Tendency (the “Middle Point”)


– Mean, Median, Mode
– If your frequency distribution shows outliers, you might want to
use the median instead of the mean

• Measures of Dispersion (How “spread out” the data are)


― Variance, standard deviation, standard error of the mean
― Describe how “spread out” a distribution of scores is
― High numbers for variance and standard deviation may mean
that scores are “all over the place” and do not necessarily fall
close to the mean

In research, means are usually presented along with standard


deviations or standard errors.
Frequency distribution of random errors

❖ As number of measurements increases the distribution becomes


more stable
- The larger the effect the fewer the data you need to identify it
❖ Many measurements of continuous variables show a bell-
shaped curve of values this is known as a Gaussian distribution.
Central value: The Median
• Middlemost or most central item in the set of
ordered numbers; it separates the distribution
into two equal halves
• If odd n, middle value of sequence
– if X = [1,2,4,6,9,10,12,14,17]
– then 9 is the median
• If even n, average of 2 middle values
– if X = [1,2,4,6,9,10,11,12,14,17]
– then 9.5 is the median; i.e., (9+10)/2
• Median is not affected by extreme values
Central value: The Mode
• The mode is the most frequently occurring
number in a distribution
– if X = [1,2,4,7,7,7,8,10,12,14,17]
– then 7 is the mode
• Easy to see in a simple frequency distribution
• Possible to have no modes or more than one
mode
– bimodal and multimodal
• Don t have to be exactly equal frequency
– major mode, minor mode
• Mode is not affected by extreme values
When to Use What
• Mean is a great measure. But, there are
times when its usage is inappropriate or
impossible.
– Nominal data: Mode
– The distribution is bimodal: Mode
– You have ordinal data: Median or mode
– Are a few extreme scores: Median
Mean, Median, Mode

Mean Mode Mean Mean


Mode
Median
Median Mode Median

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed
Symmetrical vs. Skewed Frequency
Distributions
• Symmetrical distribution
– Approximately equal numbers of
observations above and below the middle
• Skewed distribution
– One side is more spread out that the other,
like a tail
– Direction of the skew
• Positive or negative (right or left)
• Side with the fewer scores
• Side that looks like a tail
Example: Information collected on the average height of respondents in
two surveys is as follows:
Measure Survey 1 Survey 2
Mean 150 145
Median 141 152
S.D 30 30

Can we conclude that the two distributions are similar in their variation?
Solution:
A look at the information available reveals that both surveys have equal
dispersion of 30 cm. However to establish if the two distributions are
similar or not a more comprehensive analysis is required i.e. we need to
work out a measure of skewness.
SKP=(Mean−Mode)/Standard Deviation

Mode=3Median−2Mean
College A: Mode=3(141)−2(150)=423−300=123
Skp=(150−123)/30=27/30=0.9
College B: Mode=3(152)−2(145)=456−290
Skp=(142−166)/30=(−24)/30=−0.8
Variability
• Dispersion
– How tightly clustered
or how variable the
values are in a data set.
• Example
– Data set 1:
[0,25,50,75,100]
– Data set 2:
[48,49,50,51,52]
– Both have a mean of
50, but data set 1
clearly has greater
Variability than data set
2.
Summary Statistics:
Variability
• “Where are the ends of the distribution?
How are cases distributed around the
middle?”
– Range = difference between highest and
lowest scores
– Standard deviation = measure of variability;
involves deviations of scores from mean; most
scores fall within one standard deviation
above or below mean.
Variability
• Coefficient of Variation
– Standard variation is an absolute measure of
dispersion. When comparison has to be made
between two series then the relative measure
of dispersion, known as coeff. of variation is
used.
• CV=(σ/Xmean)×100
Coeff. of variation
Given the following data, which project is more
risky?
Year 1 2 3 4 5
Project X (Cash profit in Ush m) 10 15 25 30 55
Project Y (Cash profit in USh m) 5 20 40 40 30

In order to identify the risky project, we have to


identify which of these projects is less consistent
in yielding profits. Hence we work out the
coefficient of variation.
Project X Project y
X Xi−X¯ x2 Y Yi−Y¯ y2
x y

10 -17 289 5 -22 484


15 -12 144 20 -7 49
25 -2 4 40 13 169
30 3 9 40 13 169
55 28 784 30 3 9
∑X=135 ∑x2=1230 ∑Y=135 ∑y2=880
Coeff of variation
Project X
Here Xmean=∑X/N=∑135/5=27 and σx=√(∑x2)/N
⇒σx=√(1230/5)=15.68
⇒CVx=σx/Xmean×100=(15.68/27)×100=58.07

Project Y
Here Ymean=∑Y/N=∑135/5=27 and σy=√(∑y2)/N
⇒σy=√(880/5)=13.26
⇒CVy=σy/Ymean×100=(13.26/27)×100=49.11

Since coeff. of variation is higher for project X than for


project Y, hence despite the average profits being same,
project X is more risky.
Dispersion: The Range
• The Range is one measure of dispersion
– The range is the difference between the maximum
and minimum values in a set
• Example
– Data set 1: [1,25,50,75,100]; R: 100-1 +1 = 100
– Data set 2: [48,49,50,51,52]; R: 52-48 + 1= 5
– The range ignores how data are distributed and only
takes the extreme scores into account

– RANGE = (Xlargest – Xsmallest)


Quartiles
• Split Ordered Data into 4 Quarters
25% 25% 25% 25%

( Q1 ) ( Q2 ) ( Q3 )

Interquartile Range = Q3 - Q1
Not affected by extreme values
Box plot
• The box plot is a standardized way to
display the distribution of data based on
following five number summary.
– Minimum
– First Quartile
– Median
– Third Quartile
– Maximum
• For a uniformly distributed data set in box
plot diagram, the central rectangle spans
the first quartile to the third quartile (or
the interquartile range, IQR).
• A line inside the rectangle shows the
median and "whiskers" above and below
the box show the locations of the
minimum and maximum values.
• Such box plot displays the full range of
variation from min to max, the likely range
of variation, the IQR, and the median.
Box plot
D1 D2 • Here both
0.22 -5.13 datasets are
-0.87 -2.19 uniformly
-2.39 -2.43 balanced around
-1.79 -3.83
0.37 0.5
zero so mean is
-1.54 -3.25
around zero. In
1.28 4.32 first data set
-0.31 1.63 variation ranges
-0.74 5.18 approximately
1.72 -0.43 from -2.5 to 2.5
0.38 7.11 whereas in
-0.17 4.87 second data set
-0.62 -3.1 ranges
-1.1 -5.81 approximately
0.3 3.76 from -6 to 6.
0.15 6.31
2.3 2.58
0.19 0.07
-0.5 5.76
-0.09 3.5
Comparing plots
• Groups of population can be compared using
box and whisker plots. Overall visible spread
and difference between median is used to draw
conclusion that there tends to be a difference
between two groups or not.
Comparing plots
• P=(DBM/OVS)×100
• Where −
– P = percentage difference
– DBM = Difference Between Medians.
– OVS = Overall Visible Spread.
• Rules
– For a sample size of 30 if this percentage is greater than
33% there tends to be a difference between two groups.
– For a sample size of 100 if this percentage is greater
than 20% there tends to be a difference between two
groups.
– For a sample size of 1000 if this percentage is greater
than 10% there tends to be a difference between two
groups.
Solution
Describe the difference • OVS=13−6 =7
between following sets • DBM=10−7 =3
of data.
• P=(DBM/OVS)×100
=3/7×100 =42.85
Sr. No. Name Set A Set B
• As percentage is over
1 Max 12 15
33% thus there is
2 UQ 10 13
difference between
3 Median 7 10
Set A and Set B. It is
4 LQ 6 9
likely that Set B is
5 Min 5 6
greater than Set A.
INFERENTIAL STATISTICS
Inferential statistics can be used to prove or
disprove theories, determine associations
between variables, and determine if findings
are significant and whether or not we can
generalize from our sample to the entire
population
Correlation
• When to use it?
– When you want to know about the association or relationship
between two continuous variables
• Ex) food intake and weight; drug dosage and blood pressure; air
temperature and metabolic rate, etc.

• What does it tell you?


– If a linear relationship exists between two variables, and how strong
that relationship is

• What do the results look like?


– The correlation coefficient = Pearson’s r
– Ranges from -1 to +1
Correlation
Guide for interpreting
strength of correlations:
§ 0 – 0.25 = Little or no
relationship

§ 0.25 – 0.50 = Fair degree


of relationship

§ 0.50 - 0.75 = Moderate


degree of relationship

§ 0.75 – 1.0 = Strong


relationship

§ 1.0 = perfect correlation


Correlation
• How do you interpret it?
– If r is positive, high values of one variable are associated with high values of the
other variable (both go in SAME direction - ↑↑ OR ↓↓)
• Ex) competitiveness tends to rise with productivity, thus the two variables are
positively correlated

– If r is negative, low values of one variable are associated with high values of the
other variable (opposite direction - ↑↓ OR ↓ ↑)
• Ex) Heart rate tends to be lower in persons who exercise frequently, the two
variables correlate negatively

– Correlation of 0 indicates NO linear relationship

• How do you report it?


– “competitiveness was positively correlated with productivity (r = .75, p < . 05).”

Tip: Correlation does NOT equal causation!!! Just because two variables are highly correlated, this does NOT mean
that one CAUSES the other!!!
Regression
• Once the degree of relationship between variables
has been established using correlation analysis, it
is natural to delve into the nature of relationship.
• Regression analysis helps in determining the
cause and effect relationship between variables.
• It is possible to predict the value of other variables
(called dependent variable) if the values of
independent variables can be predicted using a
graphical method or the algebraic method.
Regression - graphical
• It involves drawing a scatter diagram with
independent variable on X-axis and dependent
variable on Y-axis. After that a line is drawn in such a
manner that it passes through most of the distribution,
with remaining points distributed almost evenly on
either side of the line.
• A regression line is known as the line of best fit that
summarizes the general movement of data. It shows
the best mean values of one variable corresponding to
mean values of the other. The regression line is based
on the criteria that it is a straight line that minimizes
the sum of squared deviations between the predicted
and observed values of the dependent variable.
Purpose of Regression Analysis
• Regression Analysis is Used Primarily to
Model Causality and Provide Prediction
– Predict the values of a dependent (response)
variable based on values of at least one
independent (explanatory) variable
– Explain the effect of the independent
variables on the dependent variable
Types of Regression Models
Positive Linear Relationship Relationship NOT Linear

Negative Linear Relationship No Relationship


Simple Linear Regression Model

• Relationship between Variables is


Described by a Linear Function
• The Change of One Variable Causes the
Other Variable to Change
• A Dependency of One Variable on the
Other
Simple Linear Regression Model
(continued)

Population regression line is a straight line that


describes the dependence of the average value
(conditional mean) of one variable on the other
Population Slope Random
Population Coefficient Error
Y Intercept

Yi = b 0 + b1 X i + e i
Dependent Population Independent
(Response)
Variable
Regression
Line µY | X (Explanatory)
Variable
(Conditional Mean)
Linear Regression Equation
Sample regression line provides an estimate of the
population regression line as well as a predicted
value of Y
Sample
Sample Slope
Y Intercept Coefficient

Yi = b0 + b1 X i + ei Residual

Yˆ = b0 + b1 X =
Simple Regression Equation
(Fitted Regression Line, Predicted Value)
Linear Regression Equation
(continued)
• b0 and b1 are obtained by finding the
values of b0 and b1 that minimize the
sum of the squared residuals

å( ) = åe
n 2 n
Yi - Yˆi 2
i
i =1 i =1

• b0 provides an estimate of b 0
• b1 provides an estimate of b1
Regression example

A researcher has found that there is a correlation


between the weight tendencies of father and son.
He is now interested in developing regression
equation to establish strength of correlation
between the two variables from the given data:

Weight of father (Kg)-X 69 63 66 64 67 64 70 66 68 67 65 71


Weight of Son (Kg)-Y 70 65 68 65 69 66 68 65 71 67 64 72
Summary output - excel
Regression Statistics
Multiple R 0.822
R Square 0.676 Y = 10.22+0.86X
Adjusted R Square 0.644
Standard Error 1.559
Observations 12

ANOVA
Significanc
df SS MS F eF
Regression 1 50.694 50.694 20.857 0.001
Residual 10 24.306 2.431
Total 11 75

Coefficient Standard
s Error t Stat P-value Lower 95% Upper 95%
Intercept 10.218 12.551 0.814 0.435 -17.746 38.183
X Variable 1 0.859 0.188 4.567 0.001 0.440 1.278
Interpretation
• T-test is small sample test. It was developed by William
Gosset in 1908. He published this test under the pen name of
"Student". Therefore, it is known as Student's t-test. For
applying t-test, the value of t-statistic is computed. In our
example, t-value is 4.57 which is >2 so coefficient of 0.86 is
statistically significant
• R-squared measures the proportion of the variation in your
dependent variable (Y) explained by your independent
variables (X) for a linear regression model implying 68% of
the variation in Y is explained by X.
• Adjusted R-squared adjusts the statistic based on the number
of independent variables in the model. If you add more and
more useless variables to a model, adjusted r-squared will
decrease. If you add more useful variables, adjusted r-squared
will increase.
• Adjusted R2 will always be less than or equal to R2. You only
need R2 when working with samples. In other words, R2 isn't
necessary when you have data from an entire population.
Testing of hypotheses
Definition of p-value.
p-value = probability of observing a value more
extreme that actual value observed, if the null
hypothesis is true

The smaller the p-value, the more unlikely the


null hypothesis seems an explanation for the data

Interpretation for the example


• If results p<0.05, reject Ho and conclude not
relationship
• if it p>0.05 do not reject Ho
Testing of hypotheses

Null hypothesis H0 - weight of father has no influence


on son’s weight

Alternative hypothesis HA - Weight of father influences


son’s weight

Statistical method are used to test hypotheses

The null hypothesis is the basis for statistical test.


Testing of hypotheses
Type I and Type II Errors. Example.

Decision No relationship Relationship


Accept Ho OK Type II error

Reject Ho Type I error OK


Summary
• Descriptive statistics can be used with nominal, ordinal,
interval and ratio data

• Frequencies and percentages describe categorical data and


means and standard deviations describe continuous
variables

• Inferential statistics can be used to determine associations


between variables and predict the likelihood of outcomes or
events

• Inferential statistics tell us if our findings are significant and


if we can infer from our sample to the larger population
Contact
[email protected]
Trade Policy Expert
Trade Policy Training Centre in Africa
(trapca),
ESAMI
www.trapca.org
Thank you

You might also like