RM Data Analysis
RM Data Analysis
What do we analyze?
• Variable – characteristic that varies
• Data – information on variables (values)
• Data set – lists variables, cases, values
• Qualitative variable – discrete values, categories.
– Frequencies, percentages, proportions
• Quantitative variable- range of numerical values
– Mean, median, range, standard deviation, etc.
Concepts
• A variable is any measured characteristic or attribute
that differs for different subjects. For example, if the
weight of 30 subjects were measured, then weight
would be a variable.
Quantitative and Qualitative
• Quantitative variables are measured on
an ordinal, interval, or ratio scale
• Qualitative variables are measured on a nominal scale.
Qualitative variables are sometimes called "categorical
variables.”
• If five-year old subjects were asked to name their
favorite color, then the variable would be qualitative.
If the time it took them to respond were measured,
then the variable would be quantitative.
Concepts
• Continuous and Discrete
Some variables (such as reaction time) are
measured on a continuous scale. There is an
infinite number of possible values these variables
can take on.
Qualitative Questions
• Open questions are harder to interpret as they give unique and wide ranging
answers
No common way of analysing
• Organise comments into similar categories
• Attempt to identify patterns, or associations and relationships
• To analyse opinions on a spreadsheet it is best to use rating questions
Creating a data set
• May involve coding and data entry
• Coding = assigning numerical value to
each value of a variable
– Gender: 1= male, 2 = female
– Year in school: 1= primary, 2= O level, etc.
– May need codes for missing data (no
response, not applicable)
– Large data sets come with codebooks
Descriptive Statistics
Descriptive statistics can be used to summarize
and describe a single variable (UNIvariate)
• Frequencies (counts) & Percentages
– Use with categorical (nominal) data
• Levels, types, groupings, yes/no
12 1 0 3 4 0 1 1 1 2 2 3 2 3 2 1 4 0 0
• Steps to be followed to present this data in a frequency
distribution table.
– Divide the results (x) into intervals, and then count the number
of results in each interval. In this case, the intervals would be the
number of households with no car (0), one car (1), two cars (2)
and so forth.
– Make a table with separate columns for the interval numbers
(the number of cars per household), the tallied results, and the
frequency of results in each interval. Label these columns
Number of cars, Tally and Frequency.
– Read the list of data from left to right and place a tally mark in
the appropriate row. For example, the first result is a 1, so place
a tally mark in the row beside where 1 appears in the interval
column (Number of cars). The next result is a 2, so place a tally
mark in the row beside the 2, and so on. When you reach your
fifth tally mark, draw a tally line through the preceding four
marks to make your final frequency calculations easier to read.
– Add up the number of tally marks in each row and record them
in the final column entitled Frequency
Frequency table for the Distribution of cars per household
number of cars registered 7
in each household 6
Frequency
5
4
Numbe Tally Freque 3
2
r of ncy (f) 1
cars (x) 0
0 1 2 3 4
3 ||| 3 2
3, 14%
4 || 3 1 6, 29%
0 5, 24%
0 2 4 6 8
60
50
40
30
20
10
0
Strongly Agree Disagree Strongly
Agree Disagree
Summary statistics
• Percent = relative frequencies;
standardized units.
• Cumulative frequency or percent =
frequency at or below a given category (at
least ordinal data required)
Visual Presentation of Data
• Bar graph (column chart, histogram): best
with fewer categories
• Pie chart: good for displaying percentages;
easily understood by general audience
• Line graph: good for numerical variables
with many values or for trend data
Bivariate histogram
Combining different units
Characteristics of firms and response to competition in Tanzania, Uganda and Kenya
70% 500
450
60%
400
50% 350
Number of firms
300
Proportion
40%
250
30%
200
20% 150
100
10%
50
0% 0
New pdct Innov ated New pdct Innov ated New pdct Innov ated
Tanzania Uganda Kenya
Exporters Small (5-19 employees Medium (20-99 employe Large (100 employees
Young(<20yrs) Mid(21-50yrs) Old(>51) No of firms
Unusual features
• Outliers • Gaps
Graphing data – scatter diagram
Graphing data – scatter diagram
Patterns of Data in Scatterplots
Can we conclude that the two distributions are similar in their variation?
Solution:
A look at the information available reveals that both surveys have equal
dispersion of 30 cm. However to establish if the two distributions are
similar or not a more comprehensive analysis is required i.e. we need to
work out a measure of skewness.
SKP=(Mean−Mode)/Standard Deviation
Mode=3Median−2Mean
College A: Mode=3(141)−2(150)=423−300=123
Skp=(150−123)/30=27/30=0.9
College B: Mode=3(152)−2(145)=456−290
Skp=(142−166)/30=(−24)/30=−0.8
Variability
• Dispersion
– How tightly clustered
or how variable the
values are in a data set.
• Example
– Data set 1:
[0,25,50,75,100]
– Data set 2:
[48,49,50,51,52]
– Both have a mean of
50, but data set 1
clearly has greater
Variability than data set
2.
Summary Statistics:
Variability
• “Where are the ends of the distribution?
How are cases distributed around the
middle?”
– Range = difference between highest and
lowest scores
– Standard deviation = measure of variability;
involves deviations of scores from mean; most
scores fall within one standard deviation
above or below mean.
Variability
• Coefficient of Variation
– Standard variation is an absolute measure of
dispersion. When comparison has to be made
between two series then the relative measure
of dispersion, known as coeff. of variation is
used.
• CV=(σ/Xmean)×100
Coeff. of variation
Given the following data, which project is more
risky?
Year 1 2 3 4 5
Project X (Cash profit in Ush m) 10 15 25 30 55
Project Y (Cash profit in USh m) 5 20 40 40 30
Project Y
Here Ymean=∑Y/N=∑135/5=27 and σy=√(∑y2)/N
⇒σy=√(880/5)=13.26
⇒CVy=σy/Ymean×100=(13.26/27)×100=49.11
( Q1 ) ( Q2 ) ( Q3 )
Interquartile Range = Q3 - Q1
Not affected by extreme values
Box plot
• The box plot is a standardized way to
display the distribution of data based on
following five number summary.
– Minimum
– First Quartile
– Median
– Third Quartile
– Maximum
• For a uniformly distributed data set in box
plot diagram, the central rectangle spans
the first quartile to the third quartile (or
the interquartile range, IQR).
• A line inside the rectangle shows the
median and "whiskers" above and below
the box show the locations of the
minimum and maximum values.
• Such box plot displays the full range of
variation from min to max, the likely range
of variation, the IQR, and the median.
Box plot
D1 D2 • Here both
0.22 -5.13 datasets are
-0.87 -2.19 uniformly
-2.39 -2.43 balanced around
-1.79 -3.83
0.37 0.5
zero so mean is
-1.54 -3.25
around zero. In
1.28 4.32 first data set
-0.31 1.63 variation ranges
-0.74 5.18 approximately
1.72 -0.43 from -2.5 to 2.5
0.38 7.11 whereas in
-0.17 4.87 second data set
-0.62 -3.1 ranges
-1.1 -5.81 approximately
0.3 3.76 from -6 to 6.
0.15 6.31
2.3 2.58
0.19 0.07
-0.5 5.76
-0.09 3.5
Comparing plots
• Groups of population can be compared using
box and whisker plots. Overall visible spread
and difference between median is used to draw
conclusion that there tends to be a difference
between two groups or not.
Comparing plots
• P=(DBM/OVS)×100
• Where −
– P = percentage difference
– DBM = Difference Between Medians.
– OVS = Overall Visible Spread.
• Rules
– For a sample size of 30 if this percentage is greater than
33% there tends to be a difference between two groups.
– For a sample size of 100 if this percentage is greater
than 20% there tends to be a difference between two
groups.
– For a sample size of 1000 if this percentage is greater
than 10% there tends to be a difference between two
groups.
Solution
Describe the difference • OVS=13−6 =7
between following sets • DBM=10−7 =3
of data.
• P=(DBM/OVS)×100
=3/7×100 =42.85
Sr. No. Name Set A Set B
• As percentage is over
1 Max 12 15
33% thus there is
2 UQ 10 13
difference between
3 Median 7 10
Set A and Set B. It is
4 LQ 6 9
likely that Set B is
5 Min 5 6
greater than Set A.
INFERENTIAL STATISTICS
Inferential statistics can be used to prove or
disprove theories, determine associations
between variables, and determine if findings
are significant and whether or not we can
generalize from our sample to the entire
population
Correlation
• When to use it?
– When you want to know about the association or relationship
between two continuous variables
• Ex) food intake and weight; drug dosage and blood pressure; air
temperature and metabolic rate, etc.
– If r is negative, low values of one variable are associated with high values of the
other variable (opposite direction - ↑↓ OR ↓ ↑)
• Ex) Heart rate tends to be lower in persons who exercise frequently, the two
variables correlate negatively
Tip: Correlation does NOT equal causation!!! Just because two variables are highly correlated, this does NOT mean
that one CAUSES the other!!!
Regression
• Once the degree of relationship between variables
has been established using correlation analysis, it
is natural to delve into the nature of relationship.
• Regression analysis helps in determining the
cause and effect relationship between variables.
• It is possible to predict the value of other variables
(called dependent variable) if the values of
independent variables can be predicted using a
graphical method or the algebraic method.
Regression - graphical
• It involves drawing a scatter diagram with
independent variable on X-axis and dependent
variable on Y-axis. After that a line is drawn in such a
manner that it passes through most of the distribution,
with remaining points distributed almost evenly on
either side of the line.
• A regression line is known as the line of best fit that
summarizes the general movement of data. It shows
the best mean values of one variable corresponding to
mean values of the other. The regression line is based
on the criteria that it is a straight line that minimizes
the sum of squared deviations between the predicted
and observed values of the dependent variable.
Purpose of Regression Analysis
• Regression Analysis is Used Primarily to
Model Causality and Provide Prediction
– Predict the values of a dependent (response)
variable based on values of at least one
independent (explanatory) variable
– Explain the effect of the independent
variables on the dependent variable
Types of Regression Models
Positive Linear Relationship Relationship NOT Linear
Yi = b 0 + b1 X i + e i
Dependent Population Independent
(Response)
Variable
Regression
Line µY | X (Explanatory)
Variable
(Conditional Mean)
Linear Regression Equation
Sample regression line provides an estimate of the
population regression line as well as a predicted
value of Y
Sample
Sample Slope
Y Intercept Coefficient
Yi = b0 + b1 X i + ei Residual
Yˆ = b0 + b1 X =
Simple Regression Equation
(Fitted Regression Line, Predicted Value)
Linear Regression Equation
(continued)
• b0 and b1 are obtained by finding the
values of b0 and b1 that minimize the
sum of the squared residuals
å( ) = åe
n 2 n
Yi - Yˆi 2
i
i =1 i =1
• b0 provides an estimate of b 0
• b1 provides an estimate of b1
Regression example
ANOVA
Significanc
df SS MS F eF
Regression 1 50.694 50.694 20.857 0.001
Residual 10 24.306 2.431
Total 11 75
Coefficient Standard
s Error t Stat P-value Lower 95% Upper 95%
Intercept 10.218 12.551 0.814 0.435 -17.746 38.183
X Variable 1 0.859 0.188 4.567 0.001 0.440 1.278
Interpretation
• T-test is small sample test. It was developed by William
Gosset in 1908. He published this test under the pen name of
"Student". Therefore, it is known as Student's t-test. For
applying t-test, the value of t-statistic is computed. In our
example, t-value is 4.57 which is >2 so coefficient of 0.86 is
statistically significant
• R-squared measures the proportion of the variation in your
dependent variable (Y) explained by your independent
variables (X) for a linear regression model implying 68% of
the variation in Y is explained by X.
• Adjusted R-squared adjusts the statistic based on the number
of independent variables in the model. If you add more and
more useless variables to a model, adjusted r-squared will
decrease. If you add more useful variables, adjusted r-squared
will increase.
• Adjusted R2 will always be less than or equal to R2. You only
need R2 when working with samples. In other words, R2 isn't
necessary when you have data from an entire population.
Testing of hypotheses
Definition of p-value.
p-value = probability of observing a value more
extreme that actual value observed, if the null
hypothesis is true