Lecture 8 Data Analysis
Lecture 8 Data Analysis
Data Analysis
Agenda
4
Hypotheses Testing
• Examples
5
Class Work
6
Editing data (1)
• Responses to open ended questions needs to be edited
• Information that may be noted down by the interviewer,
observer, or researcher in a hurry must be clearly
deciphered (lack of clarity confusion)
• Recommend to edit data on the very same day that the
data are collected
– Before forgetting
– Enabling to contact the respondents for further information
or clarification
• Editing should be identifiable through the use of a
different colour pencil or ink so that the original
information is still available in case of further doubts
later.
7
Editing data (2)
• Data have to be checked for incompleteness
and inconsistencies
– Inconsistencies can logically correct (married
number of years married)
– Check before make corrections if possible
(otherwise bias may affect the goodness of data)
• Some cases may not be simple to edit, and
omissions could be left unnoticed and not
rectified (affects goodness of data)
8
Handling missing data
• Reasons to have blank responses
– Could not understand the problem
– Did not know the answer
– Was not willing to answer
– Was simply indifferent to the need to respond the
entire questionnaire
• If 25% of the questionnaire is not answered, better to
throw it without using it for data analysis
• Important to mention number of returned but unused
responses due to excessive missing data (final
report submitted to the sponsor)
9
Blank responses (minor scale)
• An interval-scaled item with a mid-point (neutral
value) would be to assign the midpoint in the scale
as the response to that particular item
• Allow the computer to ignore the blank responses
when analyzing (reduce sample size)
• Assign to the item the mean value of the responses
of all other responded to that particular item
• Mean of the responses of this particular respondent
to all other questions measuring this variable
• Random number within the scale
• Linear interpolation from adjacent point
• Linear trend in SPSS
10
Coding
• Convenient to use scanner sheets to input data than keying them.
When scanning is not possible due to some reason, it is better to
use coding sheet first to transcribe data from the questionnaire and
then key in the data (avoid flipping large questionnaires)
• Efficient when some thought is given to coding at the time of
designing questionnaire
• Human errors can occur while coding. At least 10% of the coded
questionnaires should therefore be checked for coding accuracy
• Rating scales
– Dichotomous scale
– Category scale
– Likert scale
– Numerical scale
– Semantic differential scale-via meaning
– Itemized rating scale
– Fixed or constant sum rating scale
– Stapel scale
– Graphic rating scale 11
– Consensus scale
Categorization
12
Entering data
13
Data analysis
Objectives:
– Getting feel for the data
– Testing the goodness of data
– Testing the hypothesis developed for research
14
Getting feel for data
– Better to obtain
• Frequency distributions for the demographic variables
• Central tendency measures (mean, median, mode) and
dispersion measures (standard deviation, range and variance,
absolute deviation) on the other dependent and independent
variables
• An intercorrelation matrix of the respective variables,
irrespective of whether or not the hypotheses are directly
related to the analysis
– If an item has a little variation, researcher would suspect that
particular question was not properly worded (not understand the
intent of question). Otherwise it could be explained
– Graphical charts may be much easy
• Histograms, bar charts
– Good to know how the dependent and independent variables in
the study related to each other (inter-correlation matrix)
– If the correlation between two variables happens to be high 15
(>0.75) check them for two different concepts
Statistics
• Descriptive statistics
– Statistics that describe the phenomenon of
interest
• Inferential statistics
– Statistical results that let us draw inferences
from a sample to population
– Can be categorized into two:
• Parametric
– Based on the assumption that the population from which
is drawn is normally distributed and data are collected in
interval or ratio scale
• Non-parametric
– No explicit assumption on distribution
– Used when data are collected on a nominal or ordinal 21
scale
Descriptive statistics (1)
• Frequency
– Simply refer to number of times various subcategories of
a certain phenomenon occurs (percentage and
cumulative percentage can be calculated)
– Information can graphically represented as a histogram
or bar chart
– Eg: A marketing manager wants to know how many units
of each brand of coffee are sold
– Desire to obtain frequencies on a nominally scaled
variable (grouped into non-overlapping subcategories)
– In management research, frequencies are generally for
nominal variables such as gender and educational level
22
Descriptive statistics (2)
Central tendency
• Mean:
– the mean is the sum of the data points divided by the number of
data points.
– The mean is that value that is most commonly referred to as the
average.
– We will use the term average as a synonym for the mean and the
term typical value to refer generically to measures of location.
• Median
– the median is the value of the point which has half the data smaller
than that point and half the data larger than that point.
• Mode
– the mode is the most frequently occurring phenomenon
– It is not necessarily unique.
– The mode is typically used in a qualitative fashion
– It is the midpoint of the class interval of the histogram with the
highest peak
23
A few of the more common alternative measures of central tendency:
24
Descriptive statistics (3)
Dispersion (i)
• Range:
– the range is the largest value minus the smallest value in a data
set.
– Note that this measure is based only on the lowest and highest
extreme values in the sample.
– The spread near the center of the data is not captured at all.
• Variance
– The variance is roughly the arithmetic average of the squared
distance from the mean.
– Squaring the distance from the mean has the effect of giving
greater weight to values that are further from the mean.
– Although the variance is intended to be an overall measure of
spread, it can be greatly affected by the tail behavior.
• Standard deviation
– Measure of dispersion for interval and ratio scaled data
– Offers an index of the spread of a distribution or the variability in
the data
25
Descriptive statistics (4)
Dispersion (ii)
• Other measures
– Percentiles, deciles, and quartiles become meaningful
– The median divides the total realm of observation into two equal
halves, the quartile divides it into four equal parts, the decile into
10 and the percentile into 100.
– Inter-quartile range: observations excluding the bottom and top
25% quartiles
– Box-and-whisker plot
• A graphical device that portrays central tendencies, percentiles and
variability
• A box is drawn extending from the first to 3rd quartiles and lines are
drawn from either side of the box to the extreme scores
26
Inferential statistics (1)
• Interested to know or infer the data through
analysis
– The relationship between two variables
• Between advertisement and sales
– Differences in a variable among different
subgroups
• Women or men buy more of the product
– How several independent variables might
explain the variance in dependent variable
• How investment in a stock market are influenced
by the level of unemployment, perceptions of
economy, disposable incomes, and dividend
expectations
27
Inferential statistics (2)
• Correlation
– To know how one variable is related to the others
– A Pearson correlation matrix indicates the direction, strength and
significance of the bivariate relationship
– The correlation is derived by assessing the variations in one
variable as another variable varies (scatter diagram helps to
identify the variation)
– Perfect positive correlation is +1 and perfect negative correlation
is -1.
– For example, the correlation r=0.56 also indicates that the
variables would explain the variance in one another to the extent
of 31.4% (0.562)
– Strength of relationship between two variables can be generated
for variables measured on interval or ratio scale
– Non-parametric tests are also available to assess the relationship
between variables not measured on interval or ratio scale.
Examine the relationship between two ordinal variables
• Spearman’s rank correlation
• Kendall’s rank correlation 28
Hypothesis testing
29
Inferential statistics (5)
• Significant mean differences : ANOVA
– T-test only for two groups, but ANOVA helps to examine the
significant mean differences among more than two groups on an
interval or ratio scaled dependent variable
• Examine the significant mean differences in the amount of sales by
those who are sent to training schools; those who are given OJT;
those who have tutored by the sales manager
– F-distribution is a probability distribution of sample variances and
the family of distributions changes with the changes in sample
size
– Tests can be used to detect whether exactly mean difference lies
• Duncan multiple range test
• Turkey’s test
• Student-Newman-Keul’s test
– Kruskal-wallis one way analysis of variance is the non parametric
test used when the dependent variable is an ordinal scale and
independent variable is nominally scaled
32
Inferential statistics (6)
• Multiple regression analysis (1):
– Correlation coefficient r indicates the strength of relationship
between two variables, it gives us no idea of how much of the
variance in the dependent or criterion variable will be explained
when several variables are theorized to simultaneously influence it
– There may be independent variables correlated to the dependent
variable in varying degrees, but they might also be inter-correlated
• Task difficulty is likely to be related to supervisory support, pay might be
correlated to task difficulty and all three task difficulty, supervisory
support and pay might influence the organizational culture
– When the variables jointly regressed against the dependent
variable in an effort to explain the variance in it, the individual
correlations collapse into what is called a multiple r or multiple
correlation
– The square of multiple r, R square as it is commonly known, is the
amount of variance explained in the dependent variable by the
predictors (multiple regression analysis)
33
Inferential statistics (7)
• Multiple regression analysis (2):
– When the R-square value, the F statistic, and its significant level
is known, it is possible to interpret the results. For example:
• R-squared value : 0.63
• F value : 25.56
• Significance level p<0.001
– It is possible to say that 63% of the variance has been
significantly explained by the set of predictors. There is less than
0.001% chance of this not holding true
– Multiple regression analysis is done to examine the simultaneous
effects of several independent variables on a dependent variable
that is interval scaled.
– If we want to know which among the set of predictors is most
important in explaining the variance, do stepwise multiple
regression
– Multiple regression analysis is also done to trace the sequential
antecedents that cause the dependent variable through what is
known as path analysis.
34
Selection of statistical techniques
35
Use of non parametric tests
36
Thank you
37