Chapter Summary - SRM - Triad 2
Chapter Summary - SRM - Triad 2
Chapter Summary
Triad 2:
G Nandhan (H010-19)
Mary Shannan (H022-19)
Vivek Singh Rana (H067-19)
Chapter 12 – Analysis of quantitative data
12.1 Quantitative data
Raw quantitative data, that haven’t been processed or analysed, convey very little meaning to
most people. In order for these data into useful information they need to be processed.
Quantitative data refer to all numerical primary and secondary data and can help the
researcher to answer research questions and meet objectives.
Types of data
Quantitative data can be divided in two groups: categorical data and numerical data.
Categorical data are those whose values cannot be measured numerically but can be classified
into sets/categories according to the characteristics that describe or identify the variable or
they could be placed in rank order. There are two types of data:
• Descriptive/nominal data – these data can simply count the number of occurrences in
each category of a variable. When a variable is divided into two categories
(female/male for example) than the data are known as dichotomous data.
• Ranked/ordinal data – these are data that are a more precise form than categorical
data. An example of ranked data may be answers to rating or scale questions.
Alternatively, numerical data are those whose values are numerically measured or counted as
quantities (Berman 2008). Numerical data are therefore more precise than categorical ones
because one can assign each data value a position on a numerical scale. Numerical data can
be subdivided in two ways: based on interval and ratio data: or based on continuous or
discrete data. Interval data can state the difference (interval) between any two data values of a
certain variable, whereas ratio data can calculate the relative difference (ratio) between any
two data values of a certain variable. Continuous data are those whose values can take any
value (given that you measure them accurately) while discrete data can be measured precisely
(often whole numbers/integers).
After determining the types of data that are to be collected the researcher can start to
enter the data into data computer data processing software (RSS/EXCELL). To do this the
data need to be coded using numerical codes. This enables the researcher to enter the data
quickly with fewer errors. When this is done the data should be checked for errors.
Exploring variables
The easiest way of summarising the data is by using tables. However, tables do not
demonstrate visual significance to highest or lowest values so it may be that diagrams are a
better option for summarising the data. Another way to present data is by using a bar chart,
where the height or length of each bar represents the frequency of occurrence. Bar charts are
similar to histograms, another type of data presenting, where the area of each bar represents
the frequency of occurrence and where the continuous nature of the data is emphasised by the
absence of gaps between bars. Finally, a pictogram, also similar to a bar chart, shows a series
of pictures chosen to represent the data. Other kind of data presentation are:
Shapes of diagrams
If a diagram shows a bunching to the left and a long tail to the right (figure 12.3 on page 291)
then the data are ‘positively skewed’. If this is the other way around then the data are
‘negatively skewed’. When the data are equally distributed on each side of the highest
frequency they are ‘symmetrically skewed’.
A bell-shaped curve is called a normal distribution. With the indicator ‘kurtosis’ one
can compare a diagrams pointedness or flatness with that of the normal distribution. When a
distribution is flatter than it is called platykurtic and the kurtosis value is negative. When the
distribution is more peaked, than it is leptokurtic and the kurtosis value is positive.
Comparing variables
Contingency tables or cross tabulation are approaches one could use examine the
interdependence between variables. Other approaches are:
Turkey’s exploratory data analysis approach is a good approach to understand the data using
diagrams. Descriptive statistics, on the other hand, enable one to describe the variables
numerically. They describe a variable focus on the central tendency and the
dispersion. Central tendency is measured by general impressions of values that could be
seen as common, middling or average. These measures are determined by:
The dispersion (how data are distributed around the central tendency) could be described by:
4. Inter-quartile range – the difference within the middle 50 per cent of values
5. Standard deviation – extent to which the value differs from the mean
6. Range – the difference between the lowest and the highest values
There are two general groups of statistical significance tests: the non-parametric tests (used
when the data are not normally distributed) and the parametric tests (these are used with
numerical data).
If the latter assumption is not met, the accepted solution is to combine rows and columns
where this produces meaningful data.
An alternative statistic used to measure the association between two variables is Phi. This
statistic measures the association on a scale between –1 (perfect negative association),
through 0 (no association) to 1 (perfect association).
Ranked data Sometimes it is necessary to see whether the distribution of an observed set of
values for each category of a variable differs from a specified distribution.
Numerical data If a numerical variable can be divided into two distinct groups using a
descriptive variable you can assess the likelihood of these groups being different using an
independent groups t-test.
The following assumptions need to be met before using one-way ANOVA. More detailed
discussion is available in Hays (1994) and Dancey and Reidy (2008).
• Each data value is independent and does not relate to any of the other data values. This
means that you should not use one-way ANOVA where data values are related in some way,
such as the same case being tested repeatedly.
• The data for each group are normally distribute. This assumption is not particularly
important provided that the number of cases in each group is large (30 or more).
• The data for each group have the same variance (standard deviation squared). However,
provided that the number of cases in the largest group is not more than 1.5 times that of the
smallest group, this appears to have very little effect on the test results.
The correlation coefficient quantifies the strength of a linear relationship between two ranked
or numerical variables between a number of +1 and -1. A value of +1 means positive
correlation, which means that the two variables are exactly related and when one increases,
the other one will increase as well. A value of -1 demonstrates a negative correlation, where
the two variables are precisely related, but when one increases the other one decreases
where:
AoS is the Amount of Sales
ME is the Marketing Expenditure
NSS is the Number of Sales Staff
a is the regression constant
b1 and b1 are the beta coefficients
This equation can be translated as stating:
Amount of Salesi = value + (b1 * Marketing Expenditurei) + (b2 * Number of Sales Staffi)
When calculating a regression equation you need to ensure the following assumptions are
met:
• The extent to which the data values for the dependent and independent variables have
equal variances (this term was explained earlier in Section 12.4), also known as
homoscedasticity. Again, analysis software usually contains statistical tests for equal
variance.
• The data for the independent variables and dependent variable are normally
distributed.
Examining trends
When examining longitudinal data the first thing we recommend you do is to draw a line
graph to obtain a visual representation of the trend (Figure 12.7). Subsequent to this,
statistical analyses can be undertaken. Three of the more common uses of such analyses are:
• to examine the trend or relative change for a single variable over time;
• to compare trends or the relative change for variables measured in different units or of
different magnitudes;
• to determine the long-term trend and forecast future values for a variable.
index number of case = (data value for case/data value for base period)* 100
To compare trends
To answer some other research question(s) and to meet the associated objectives you may
need to compare trends between two or more variables measured in different units or at
different magnitudes.
Once the trend has been established, it is possible to forecast future values by continuing the
trend forward for time periods for which data have not been collected. This involves
calculating the long-term trend – that is, the amount by which values are changing
each time period after variations have been smoothed out. Once again, this is relatively
straightforward to calculate using analysis software. Forecasting can also be undertaken using
other statistical methods, including regression analysis. If you are using regression for your
time series analysis, the Durbin-Watson statistic can be used to discover whether the value
of your dependent variable at time t is related to its value at the previous time period,
commonly referred to as as autocorrelation or serial correlation, is important as it means
that the results of your regression analysis are less likely to be reliable.
Chapter-13
Quantitative Data Analysis
Quantitative data analysis covers the basic techniques for analyzing quantitative data. Usually
this analysis is done using computer simulation software. But this chapter deals with the
various aspects and methods of quantitative data analysis and does not cover the steps
involved in the same.
Quantitative data analysis starts not after the collection of data, but suitable steps and
methods of data analysis must be planned before the actual collection of data. This will help
us design questionnaires and methods that are feasible for analysis. Lack of proper planning
will actually hinder the analysis process and may also lead to improper results.
One of the main problems with the quantitative Data analysis is the concept of missing data.
When analyzing a questionnaire for example, some of the columns may be left blank. This
may be due to the fact that the participants did not want to fill that data or the question is
designed in such a manner, it is redundant to answer the same. Hence it is essential that
suitable planning must take place before the actual collection of the data.
Types of variables:
There are four types of quantitative variables that will be generated during the course of the
research. They are as follows:
1. Interval/ratio variable
These are the variables where the distances between the categories are identical across the
range of categories and can be rank ordered. Interval/ratio variables are regarded as the
highest level of measurement because they permit a wider variety of statistical analysis to
be conducted on them than with any other type.
2. Ordinal variable
These are the variables whose categories can be rank ordered similar to the Interval/ratio
variables but the distances between the categories are not equal across the range.
3. Nominal variable
These variables comprise categories that cannot be rank ordered. They are also known as
categorical variables.
4. Dichotomous variable
These variables contain data that have only two categories. For example, gender etc.
1) Univariate analysis
This refers to the analysis of one variable at a time. This can be done with the following
methods
i) Frequency tables
Frequency table provides the number of people and the percentage belonging to each
of the categories for the variable in the question. It can be used in relation to all of the
different types of variables. I an interval/ratio variable is to be presented in a
frequency table format, the categories will need to be grouped and the grouping
should not overlap.
ii) Diagrams
These are the most frequently used methods of displaying quantitative data. They are
easy to interpret and understand. Some of the common diagrams are bar chart, pie
chart and histograms. Bar chart and pie charts work well with nominal and ordinal
variables. For interval/ratio variable, histograms are used.
iii) Measures of central tendency
(a) Arithmetic mean – Average of all available data or the data being
analyzed. This is employed with interval/ratio variables. The main issue
with mean is it is vulnerable to outliers.
(b) Median – This is the mid-point of distribution of the variables. Median is
not affected by outliers. The data has to be first sorted in the ascending
order before finding the median. If there are even number of data, the
mean of the two middle variables is calculated as median. This can be used
with both interval/ratio variables and also ordinal variables.
(c) Mode – The mode is the value that occurs most frequently in a
distribution. This can be employed with all types of variables.
iv) Measures of dispersion
(a) Range – Difference between the maximum and minimum value in a
distribution of values associated with an interval/ratio variable. These are
affected by presence of outliers
(b) Standard deviation – It is the average amount of variation around the
mean. These are also affected by the outliers but their effect is offset by
division of number of values in the distribution.
(c) Box-plot – Provides an indication of both central tendency and dispersion
in a single diagram. It also indicates the outliers in the distribution.
2) Bivariate analysis
It is concerned with the analysis of two variables at a time in order to uncover whether or not
the two variables are related. Exploring the relationships between variables mean searching
for evidence that the variation in one variable coincides with the variation in another variable.
All the methods mentioned here analyses the relationships between the variables and not the
causality. Some of the methods of analysis are as follows
i) Contingency tables – It is similar to the frequency table but it allows two variables
to be simultaneously analyzed so that the relationships between the two variables
can be examined.
ii) Pearson’s r – This is used for interval/ratio variable. Some of the key features are
(a) The co-efficient will almost certainly lie between 0 and 1 – this indicates
the strength of a relationship
(b) The closer the co-efficient is to 1, the stronger the relationship and vice
versa
(c) The co-efficient will be either positive or negative – this indicates the
direction of a relationship
iii) Spearman’s rho – This is similar to the Pearson’s r method but this is used for
pairs of ordinal variables.
iv) Phi and Cramer’s V – Phi is used for the analysis of the relationship between two
dichotomous variables. Similarly, Cramer’s V is used with nominal variables.
3) Multivariate analysis
Multivariate analysis entails the simultaneous analysis of three or more variables. This
basically answers three questions, Could the relationship be spurious? Could there be an
intervening variable? And Could a third variable moderate the relationship? It uses a mix of
methods mentioned previously.
One main criterion to be taken care when analyzing data using the various methods is the
Statistical significance. Statistical significance allows the analyst to estimate how confident
he or she can be that the results deriving from a study based on a randomly selected sample
are generalizable to the population from which the sample was taken. Hence proper sampling
is essential to increase its significance. Lack of proper sampling undermines the whole
purpose of the analysis. Hence the concept of confidence interval is used to test the level of
significance. Chi-square test is used in this case to analyze the same.
CHAPTER 14: Using IBM SPSS statistics
Key Points
o SPSS can be used to implement the techniques learned in Quantitative Data Analysis,
but learning new software requires perseverance and at times the results obtained may
not seen to be worth the learning process.
o But it is worth it – it would take you far longer to perform calculations on a sample of
around 100 than to learn the software.
o If you find your self moving into much more advance techniques, the time saved is
even more substantial, particularly with large samples.
o It is better to become familiar with SPSS before you begin designing your research
instruments, so you are aware of difficulties you might have in presenting your data in
SPSS at an early stage.
PRE-REQUISITE
➢ Variables
➢ Data
➢ Measurement scales
➢ Code book
➢ Steps involved in hypothesis testing
I. VARIABLES
• A concept which can take on different quantitative values is called variable.
• Ex. What are the variables you would consider in buying a second hand bike?
Brand
Type
Age
Condition (Excellent, good, poor)
Price
• Dichotomous Variables are variables having two values only.
Yes or No
Male or Female
• Income, age or test scores are the examples of Continuous Variables.
• These variables may take on any value within a given range, or in some cases
an infinite set.
• TYPES OF VARIABLES
Independent Variables
Dependent Variables
Moderating Variables
Extraneous Variables
• Before you can enter the information from your questionnaire, interviews or
experiments into SPSS it is necessary to prepare a “code book”.
• This is a summary of the instructions you will use to convert the information
obtained from each subject or case into a format that SPSS can understand.
• Two Windows
Data window and variable window
Output window
• DATA EDITOR
Spreadsheet-like system for defining, entering, editing and displaying data.
Extension of the saved file will be’.sav’
• FREQUENCIES
This analysis produces frequency tables showing frequency counts and
percentages of the values of individual variables.
• DESCRIPTIVES
This analysis shows the maximum, minimum, mean and standard deviation of
the variables.
• CORRELATION ANALYSIS
Correlation analysis is used to describe the strength and direction of the linear
relationship between two variables.
• RELIABILITY.
The reliability of a scale indicates how free it is from random error.
Two frequently used indicators of a scale’s reliability are test-retest
reliability (also referred to as temporal stability) and internal
consistency.
The test-retest reliability of a scale is assessed by administering it to
the same people on two different occasions and calculating the
correlation between the two scores obtained.
The second aspect of reliability that can be assessed in internal
consistency.
This is the degree to which the items that make up the scale are all
measuring.
The same underlying attribute (i.e. the extent to which the items ‘hang
together’).
Internal consistency can be measured in a number of ways.
The most commonly used statistic is Cronbach’s co-efficient alpha
(available using SPSS)
This statistic provides an indication of the average correlation among
all of the items that make up the scale.
Values range from 0 to 1, with higher values indicating greater
reliability.
While different level of reliability is required, depending on the nature
and purpose of the scale, Nunnally (1978) recommends a minimum
level of .7.