SPSSnotes
SPSSnotes
PART I............................................................................................................................................................ 2
INTRODUCTION............................................................................................................................................ 2
Background ................................................................................................................................................ 2
Starting SPSS............................................................................................................................................. 3
Inputting Data ............................................................................................................................................. 3
Defining Variables ...................................................................................................................................... 3
Variable and Value Labels ......................................................................................................................... 3
Reviewing Variables ................................................................................................................................... 3
Entering Data.............................................................................................................................................. 3
FILE MANAGEMENT..................................................................................................................................... 3
Saving an SPSS for Windows XP File ....................................................................................................... 3
Backing Up Your Data................................................................................................................................ 3
Retrieving Data Files .................................................................................................................................. 3
Reading An Excel File Into SPSS .............................................................................................................. 3
INITIAL DATA CHECKING ............................................................................................................................ 3
Case Summaries ........................................................................................................................................ 3
DESCRIPTIVE STATISTICS ......................................................................................................................... 3
Frequency Tables....................................................................................................................................... 3
Descriptives ................................................................................................................................................ 3
Cross-tabulation ......................................................................................................................................... 3
Three-way tables ........................................................................................................................................ 3
EDITING AND MODIFYING THE DATASET ................................................................................................ 3
Inserting Data ............................................................................................................................................. 3
Deleting A Case.......................................................................................................................................... 3
Inserting A Variable .................................................................................................................................... 3
Deleting A Variable..................................................................................................................................... 3
Deleting An Entry In An Individual Cell....................................................................................................... 3
Moving A Variable ...................................................................................................................................... 3
Manoeuvring Between Windows ................................................................................................................ 3
Moving around Data Editor......................................................................................................................... 3
PART II........................................................................................................................................................... 3
CONSTRUCTING NEW VARIABLES............................................................................................................ 3
Computing a New Variable......................................................................................................................... 3
Computing a New Variable by using built-in Functions.............................................................................. 3
Computing Duration of Time Difference by built-in Functions.................................................................... 3
Recoding a value........................................................................................................................................ 3
Selecting a Subset of the Data................................................................................................................... 3
GRAPHS ........................................................................................................................................................ 3
Bar Charts .................................................................................................................................................. 3
Histograms ................................................................................................................................................. 3
Scatter Plots ............................................................................................................................................... 3
Plotting a Regression Line on a Scatter Plot.............................................................................................. 3
STATISTICAL INFERENCE IN SPSS ........................................................................................................... 3
Introduction................................................................................................................................................. 3
Categorical Variable ................................................................................................................................... 3
The Chi-squared test and Fishers Exact test ............................................................................................ 3
CONTINUOUS OUTCOME MEASURES ...................................................................................................... 3
Comparison of Means Using a t-test .......................................................................................................... 3
LINEAR REGRESSIONS............................................................................................................................... 3
Model Checking.......................................................................................................................................... 3
NON-PARAMETRIC METHODS ................................................................................................................... 3
COMPARISONS OF RELATED OR PAIRED VARIABLES .......................................................................... 3
Continuous Outcome Measures................................................................................................................. 3
Analysis of Binary Outcomes that are Related........................................................................................... 3
Related Ordinal Data.................................................................................................................................. 3
LOGISTIC REGRESSIONS........................................................................................................................... 3
Model Checking.......................................................................................................................................... 3
CREATING A SPSS SYNTAX ....................................................................................................................... 3
PART I
INTRODUCTION
Background
This handbook designed to introduce SPSS for Windows XP. It assumes familiarity with microsoft windows and standard windows-based office productivity software such as word processing
and spreadsheets.
SPSS for Windows XP is a popular and comprehensive data analysis package containing a
multitude of features designed to facilitate the execution of a wide range of statistical analyses. It
was developed for the analysis of data in the social sciences - SPSS means Statistical Package for
Social Science. It is well suited to analysing data from surveys and database.
This practical uses a set of data from a cross-sectional survey of respiratory function and dust levels
amongst foundry workers. The object of the survey data is to determine whether the dust levels
found in the foundries have any effect on the respiratory function.
Starting SPSS
After logging on to Windows XP, the user will be presented with a screen containing a number of
different icons. First click on the icon ACCESS to Datafiles. This enables access to the files on
drive L: that are used in this session. Once this has completed, start SPSS by clicking the Start
button then selecting
All Programs
Programs Core
Statistics
SPSS 14.0
SPSS14.0
Then the SPSS 14.0 for Windows XP screen will appear and it is called Untitled SPSS Data
Editor (shown below). In the middle of the Data Editor screen you can see another window
asking the question What would you like to do?
Type in data
If you choose the first Run the tutorial, the following screen is presented.
SPSS Version 14.0 05/07/2007
If you dont want to run tutorial, just click Cancel button then you get data editor screen.
If you choose the second option Type in data, you will get the data editor screen at which you can
input data into a new SPSS data file.
The third option Run an existing query and the fourth option Create new query using Database
Capture Wizard are not relevant to this course.
The final option Open an existing file
Each row of the table describes the attribute of one variable. Begin by entering a variable name in
the Name column. Variable Names can be up to 64 characters long, contain no spaces and it is
important to use something meaningful. It is best to stick to alphanumeric characters and start with
a letter. Once you have entered a name, SPSS defines the variable type as Numeric. You may need
to change the variable type, for example to String if you want to character data such as names or to
Date if you want to enter dates. To do this, click on the cell corresponding to the Type column. A
little combo button
will appear there. If you click that button then you will get the following
You will usually be working with Numeric, Date or String type of data. For Numeric variables
you may want to change the decimal places. If the data are integers (whole numbers) such as age in
complete years you could alter the decimal places to zero. If the numbers you are planning to enter
are very small (0.00072) or you require a high level of precision (21.7865) you may want to
SPSS Version 14.0 05/07/2007
increase the number of decimal places. Usually there is no need to change the width from 8, note
that width must be larger than the number of decimal places.
For date variable it is best to use a 4 digit year (dd.mm.yyyy)
With text strings you are given the option to change the number of characters
Where possible you are strongly advised to use numerical coding rather than strings. This makes
statistical analysis easier. If you are entering string data is longer than 8 characters, you will need to
increase the Width from the default of eight. To be able to fully display the string in data view
window you may need to increase the numbers of columns in the variable view window.
The column missing in the variable view window allows you to define which codes correspond to
missing values. You can have several values allowing you to distinguish between missing data due
to the respondent forgetting to answer rather than say not applicable or refused to answer. For
example, a code of 8 could indicate not applicable, and 9 that would indicate the respondent had
missed a question out. If a value is defined as a missing value code for a particular variable, subjects
with that code will be dropped from the analysis of that variable.
To set up missing value codes for a variable, click on the cell of the Missing column. Click on
Discrete missing values and enter the missing values for this variable in the boxes below (Up to 3
can be entered). To complete the entry press OK
The second is value label. This enables you to describe each of the values a variable may have.
These labels will be displayed on tables improving readability. For example Exposure group in the
following practical has 2 values Unexposed and Exposure to dust which have been coded as
0 and 1. The label option in the variable view window also allows you to define labels for
missing values.
To define a variable label click on cell of Label column of the Data Editor screen and enter your
description of the variable.
To define Value Labels - click the cell of the value column and then the click on the combo button
then enter the Value: window and its associated label against the Value Label window: then press
Add. The added label will then appear in the window below.
Once you have entered all the value labels for a variable press OK.
Exercise The table below lists the variables from the foundry study. Set-up the following variables
Variable
Name
idno
group
Description
(Variable Label )
Identification No
Exposure Group
Type
Width
Column
Extras
Numeric
Numeric
4
1
5
6
Labels
1 = Exposed to dust
0 = Unexposed
age
sex
Age at assessment
Numeric
Numeric
2
1
8
4
Labels
0 = female
1 = male
ht
Height in cms
Numeric
fevmeas
Measured FEV
Numeric
fevpred
Predicted FEV
Numeric
fvcmeas
Measured FVC
Numeric
fvcpred
Predicted FVC
Numeric
asthma
Numeric
3 with 2 decimal
places
3 with 2 decimal
places
3 with 2 decimal
places
3 with 2 decimal
places
1
Labels
bron
Ever had
Bronchitis
Numeric
Labels
smknow
Do you smoke
now
Have you ever
smoked
Numeric
Labels
Numeric
Labels
cigno
No of cigarettes
per day
Numeric
Missing
Value
0 = No
1 = Yes
2 = Dont Know
0 = No
1 = Yes
2 = Dont Know
1 = Yes
0 = No
0 = No
1 = Ex smoker
2 = Current smoker
-88
cigyrs
No of years
smoked
No of Years with
company
Numeric
Missing
Value
Numeric
Current exposure
Numeric
3 with 2 decimal
places
smkever
empyrs
respdust
8
8
8
-88
Reviewing Variables
Once you have created all these variables, you can check they have been set up correctly. To do this
click from the menu bar Utilities then Variable and choose the variable you require and the
following screen should appear.
Entering Data
When you finish creating all the variables you click Data View and you get the following screen
with all the variable names at the top of the spreadsheet.
10
You can now enter the data as you would in a spreadsheet. To make an entry in a particular cell on
the spreadsheet use the mouse to move the cursor to select that cell and type in the value. The value
will appear in the cell. Click on the mouse, press enter or use the cursor keys to enter that value.
If you attempt to enter data of the wrong type into a variable (for example text into a numeric
variable) the data will not be accepted. If incorrect data is entered, it can be overtyped or deleted.
11
Exercise The data below is from the foundry study for which you have just entered the variable codes. If you leave a gap in any cell in the worksheet,
SPSS will put a dot (.) and treat it as missing data. Once you have entered all the cases, it is useful to display the Value Labels of the coded values.
These are displayed by using choosing value labels button from the second row of options at the top of either the Data view or Variable View window.
Idno
Ht
1001 Exp.
49
3.59
4.49
4.45
No
No
Yes
Curr
20
31
23
1.71
1002 Exp.
46
3.39
3.91
4.12
Yes
No
Yes
Curr
20
11
16
0.69
1003 Non
34
4.26
4.80
5.14
No
No
No
Never
12
0.00
1004 Non
34
Male
4.25
4.57
5.12
No
No
Yes
Curr
12
0.00
180 4.01
12
25
16
FILE MANAGEMENT
Saving an SPSS for Windows XP File
Once you have entered some data you will want to save it to disk. It is good practice to save data at
regular intervals during data entry just in case!
To save the data you have just entered, click on the File option at the top left corner of the screen
and then on the Save As... sub-option.
To save a copy of the current SPSS for Windows XP file on your floppy disk, under Drives: click
on 7 in the save in window to generate a list of the drives.
Click on the up-arrow to move to the 31/2 Floppy (A): drive and move the cursor to the File name
window and enter a suitable name. By default SPSS will add the file extension .sav. Finally, click
on the Save button. It will help to identify the file as a SPSS datafile if the file extension .sav is
used.
13
We can also open a data file when we as start an SPSS session (see above).
Reading An Excel File Into SPSS
Often data may be already stored in another data format. SPSS can read many of these. For example
you can retrieve an Excel file into SPSS. If you put the variable names in the first row of your
speadsheet, they can be copied as variable names in SPSS file. Unlike StatsDirect, SPSS is only
able to read a single work sheet it cannot read a complete work book with several sheets. In order
that SPSS can read it, the Excel file needs to be saved in the version 4 format.
The data from the foundry study is saved in a spreadsheet L:\spssco~1\foundry. The names of the
variables have been entered in the first row. You may wish to check this by going to EXCEL. The
procedure for retrieving the data from EXCEL is similar to retrieving an SPSS data file. Click on
the File option at the top of the screen, then on the Open sub-option followed by the Data option so
14
that the screen above appears. At this point change the file type to Excel and Open the spreadsheet
named foundry. The following screen should appear.
Unless there is other data on the spreadsheet that we do not want to read we need not specify a
range. As we want to read the variable names, you have to click Read variable names button then
press OK. You will get an output window explaining variable names, types and their formats.
If you switch over to Data editor screen by clicking Window option on the menu bar or by using
the button on the status bar at the bottom of the screen, you will be able to see the variable names
and values in their proper columns. Now all the Foundry data has been read from the spreadsheet. If
we want to add variable labels and value labels we would need to go to variable view.
15
If you don't have variable names in the Excel file then when retrieving it into SPSS file you should
not click Read variable names button, just press OK button and you get the following screen.
You then have to define the variable names by clicking the Variable View as described above.
Having read data from an excel spreadsheet it is important to check what has been read in.
For example if a column on the spreadsheet contained a mix of numeric and string data
(besides the variable name at the top) either one or the other may be set to missing.
SPSS Version 14.0 05/07/2007
16
The facility allows you to look at a column or columns separately from the rest of the data.
Highlight the variables you want to display and use the arrow
Click on the Limit cases to first 100 so that all cases will
be displayed. The following output appears.
It is then easy to see any potential errors e.g. if there was
"never" in ever smoked and "yes" in do you smoke now,
there has been an error made. The left-hand side column is
the case number.
17
DESCRIPTIVE STATISTICS
The first step in data analysis is to generate descriptive statistics. This will give us a feel for the
data. It will also help us identify any inconsistencies that there may be in the data. This is
sometimes called data cleaning. Techniques that are commonly used to do this include:
Frequency Analyses
Descriptive Statistics
Cross-tabulations
Plots
Frequency Tables
A basic way to check for data errors is by carrying out a frequencies analysis on variables, to do this
click on the Analyse tile choose the Descriptive Statistics option and then choose Frequencies.
Move the variables of interest into the Variables box on the right-hand side, then click on Statistics
to select some summary statistics such as range, maximum, minimum, mean and median, which
will help you look for errors.
18
To select the variable to perform a frequency table, click on its name in the left hand list and then
press
Valid
Unexposed
Exposure to Dust
Total
Frequency
63
Percent
46.3
Valid Percent
46.3
Cumulative
Percent
46.3
100.0
73
53.7
53.7
136
100.0
100.0
To return to the data editor click on Window and take the data editor option from the list.
With the frequency table you can have a list of summary statistics as well. To do that you click on
Analyze, Descriptive Statistics, Frequencies. Bring the variable (say, ht) to the Variable(s)
window then click on Statistics option to select some summary statistics. Click on Continue and
OK button. Frequency tables can be copied into word processing documents by clicking on the
table and selecting Edit then Copy. To place in the word processing document, use Edit and Paste.
Output from Frequencies with some summary statistics
19
Height in cms
Frequency
Valid
Percent
Cumulative
Percent
.7
Statistics
158
.7
Valid Percent
.7
160
2.2
2.2
2.9
162
.7
.7
3.7
163
4.4
4.4
8.1
165
5.1
5.1
13.2
Mean
166
.7
.7
14.0
167
3.7
3.7
17.6
168
14
10.3
10.3
27.9
170
19
14.0
14.0
41.9
Mode
171
.7
.7
42.6
Std. Deviation
172
5.9
5.9
48.5
Variance
173
5.1
5.1
53.7
174
.7
.7
54.4
Height in cms
N
Valid
136
Missing
Median
0
172.97
.567
173.00
175
6.613
43.732
Skewness
.429
.208
175
26
19.1
19.1
73.5
177
5.1
5.1
78.7
Kurtosis
.393
178
3.7
3.7
82.4
.413
180
12
8.8
8.8
91.2
182
1.5
1.5
92.6
Range
183
1.5
1.5
94.1
Minimum
185
2.2
2.2
96.3
Maximum
190
2.9
2.9
99.3
Sum
100.0
192
Total
.7
.7
136
100.0
100.0
34
158
192
23524
Descriptives
The descriptives command in SPSS is useful for summarizing quantitative data. To use this click on
the Analyse tile choose the Descriptive Statistics option and then choose descriptives. Move the
variables of interest into the Variables box on the right-hand side. As with the frequencies
command we can obtain descriptive statistics for several variables at once. In the panel below we
have chosen some of the quantitative variables in the foundry data set.
20
For mean number of cigarettes per day you may get a negative answer. Check the missing value
codes and redo.
Cross-tabulation
To examine the relationship between two categorical variables, a two way Frequency Table can be
used. This is called a cross-tabulation. Click on Analyze then Descriptive Statistics and then
Crosstabs. The screen below appears. Suppose we wished to examine how smoking status related
to exposure. We could examine this by a cross-tabulation of the variables group and smkever.
Select the smoking status variable smkever labelled Have you ever smoked in the source list then
click
Select group labelled Exposure Group in the source list and click
The following result appears when the two frequency table has been completed.
Have you ever smoked * Exposure Group Crosstabulation
Count
Have you
ever smoked
Total
Never
Ex Smoker
Curr. Smoker
Exposure Group
Exposure
Unexposed
to Dust
24
20
19
19
20
34
63
73
21
Total
44
38
54
136
Two way frequency tables are more informative if they include percentages. To add percentages to
the table select Cells from the Crosstabs screen. On pressing Cells, the following screen appears.
Column, row, or total percentages can be selected by clicking the appropriate box. Whilst it is
tempting to click all three this can make the output confusing. For the table above column
percentages are the most useful as they will allow us to compare the smoking status of non-exposed
and exposed subjects. By clicking column we get the resulting table.
Have you ever smoked * Exposure Group Crosstabulation
Have you
ever smoked
Never
Ex Smoker
Curr. Smoker
Total
Count
% within Exposure Group
Count
% within Exposure Group
Count
% within Exposure Group
Count
% within Exposure Group
Exposure Group
Exposure
Unexposed
to Dust
24
20
38.1%
27.4%
19
19
30.2%
26.0%
20
34
31.7%
46.6%
63
73
100.0%
100.0%
Total
44
32.4%
38
27.9%
54
39.7%
136
100.0%
Three-way tables
You may need to do comparisons on three variables. To do this, choose Analyze then Descriptive
Statistics and then Crosstabs. Then the following screen appears. To create a three dimensional
table instead of a two dimensional table, click on a variable and move using
to layer 1 of 1 box.
If we add the variable sex we will now get separate tables for men and women giving the following
output.
SPSS Version 14.0 05/07/2007
22
SEX
male
Do you smoke
now
No
Yes
Total
female
Do you smoke
now
No
Yes
Total
Count
% within Exposure Group
Count
% within Exposure Group
Count
% within Exposure Group
Count
% within Exposure Group
Count
% within Exposure Group
Count
% within Exposure Group
Exposure Group
Exposure
Unexposed
to Dust
21
13
63.6%
43.3%
12
17
36.4%
56.7%
33
30
100.0%
100.0%
22
26
73.3%
60.5%
8
17
26.7%
39.5%
30
43
100.0%
100.0%
Total
34
54.0%
29
46.0%
63
100.0%
48
65.8%
25
34.2%
73
100.0%
Inserting Data
You may have noticed that idno 1008 was missing.
To insert it,
Click on use Data then Insert Case (immediately before IDNO 1009) and a new blank row is
added shown below.
23
You can insert the following case (idno 1008) in the blank line
Variable
Idno
Group
Sex
Ht
Fevmeas
Fevpred
Fvcmeas
Fvcpred
Value
1008
1
1
180
4.01
4.45
4.90
5.30
Variable
Asthma
Bron
Smknow
Smkever
Cigno
Cigsyrs
Empyrs
Respdust
Value
0
0
1
2
30
20
10
2.04
Deleting A Case
To delete a case, click on its number on the left of the Data Editor to highlight the row containing
the case. Press the Delete button (alternatively, click on the Edit option on the menu bar then click
on the Cut option) and the case is deleted and the cases below move up to fill the gap.
Exercise Delete case no 1008
Inserting A Variable
To insert a variable into the middle of the data, click on the variable after the position at which you
wish the variable to appear and then click on Data then Insert Variable. A blank column is
inserted before the selected variable shown here.
24
Deleting A Variable
To delete a variable, click on its name at the top of the Data Editor to highlight the column
containing the variable. Then press the Delete button. The variable is deleted and the variables to
the right move to the left to fill the gap. Now delete the variable you just created.
Moving A Variable
Insert a blank variable as mentioned above in the required position. Click on the name of the
variable to be moved (This highlights the column), Edit and Cut. Click on the name of the blank
variable and Edit then Paste.
Manoeuvring Between Windows
To manoeuvre between data editor and output screen, click on the Window option at menu bar and
from the drop down menu click on the required option (the active screen is ticked on).
Alternatively choose the window from the status bar at the bottom of the screen.
25
Description
First Variable
Last Variable
First Case
Last Case
First Value
Last Value
Allows you to select the variable to go
to
Allows you to go to a specified case
Allows you to look for a specified
value in a variable
26
PART II
CONSTRUCTING NEW VARIABLES
Sometimes we need to compute new variables from the data entered. For example in the foundry
data set we might want to compute the ratio of the measured to predicted fev. Alternatively we
might want to group ages into bands. SPSS has procedures to construct a new variable from
existing variables.
Computing a New Variable
For the foundry worker data we shall compute the variable fevratio defined as fevmeas/fevpred.
Click Transform then Compute and the following screen appears:-
Enter the name fevratio in Target variable window. If the variable is new, click on Type & Label
to define the type and variable label. To build up mathematical expression which will create the
new variable you can choose variables from the left hand box then click
numeric expression window. You can choose any of the keys on the calculator pad in the centre or
any of the functions from the built-in functions box followed by.
Select the function using up
then click on the button
and down
27
These are the functions on the calculator pad are defined as follows.
Operator
Mnemonic
form
Description
Operator
Mnemonic form
Description
Greater Than Or
Equal To
Equals
Not Equals
Logical And
Logical Or
Parentheses
Logical Not
Addition
>=
GE
*
/
**
<
>
<=
Subtraction
Multiplication
Division
Power Of
Less Than
Greater Than
Less Than Or Equal
To
=
~=
&
|
()
~
EQ
NE
AND
OR
LT
GT
LE
NOT
To compute fevratio we move fevmeas and fevpred into the numeric expression window. You
can also type a formulae into the numeric expression window. This is illustrated below.
28
Type a name, say lht, in the target variable window. Click on the arrow on the right of the
Functions box to scroll up and down through the functions. Select Arithmetic followed by Ln
function in the Functions and Special Variables box for natural log and click on Functions : 5 ,
this will put the function with a ? in parentheses in the window named Numeric Expression. Then
select the variable to replace ? i.e. ht by clicking 4 and then press OK button. Then a new
variable lht will be created (located at the end of the variable list). Having carried out a
transformation it is important to check the result. For example, taking a log of a negative value
creates a missing value. Other commonly used transformation functions are LG10, SQRT, ABS,
TRUNC etc.
29
whole thing by 365 (number of days in quarterly leap year) to get howold in years. Below is the
example.
Whenever you compute a new variable from existing data it is important to check that what you
have created is sensible. You also need to check that missing values have not been converted into
none missing values. Using the Data view tab check the value of howold.
Exercise Calculate the duration of the patients in the employment and compare with the values of
employment years provided in the data set.
Recoding a value
To assist in data analyses you often need to group a continuous variable (e.g. age) into categories
To do this select Transform then Recode. Two options are now given
The first option leads to potentially valuable information being overwritten. It is usually best
to use the second option as it is then possible to check whether the recode has worked
correctly by comparing the new and old version.
Having chosen the second option the following screen will appear. First choose an input variable
from the list on the left hand side then press
SPSS Version 14.0 05/07/2007
.
30
Then enter the name of the variable for the recoded data under Output Variable Name and press
Change.
Now press Old and New Values and the following screen appears.
Suppose we wish to recode age into bands <30, 30-39, 40-49, 50+
Click on Range Lowest Through and enter 29 into the box then click on value under New value
and enter 1 and finally press Add.
Click on Range then enter 30 and 39. Then click on New Value and enter 2 and finally press Add.
Click on Range then enter 40 and 49. Then click on New Value and enter 3 and finally press Add.
Finally click on Range Through highest enter 50 then click on New Value and enter 4 and finally
press Add.
Once you have specified all the OLD -> New recodes, click on Continue then OK on the Recode
into Different Variables screen. The following shows an example of setting up a recoded value.
31
After recoding a variable it is usually advisable to run case summaries to compare the old and new
values
To make the selection, click in the circle with the If Condition is Satisfied box, then click the If...
button. The following screen will then appear:
32
To make the selection, click in the circle with the If Condition is Satisfied box, then click the If...
button. The following panel will then appear. (group = 1 has been entered in the box provided to
select the exposed cases),
Click on the Continue tile at the bottom of the screen. Once you have returned to the main Select
Cases screen, click on the OK button. The effect of the above filter on the data is shown below.
Please note the / on the left hand side showing the records which have been excluded. To remove
the filter click on Data then Select Cases and Select all cases.
Note In order to return to the complete data set for further analyses you need to return to the select
cases option and click the all cases button.
33
GRAPHS
SPSS will produce good quality high- resolution statistical graphics. We will look at Bar Charts,
Histograms, and Scatter Plots with regression lines.
Bar Charts
Bar Charts can only be produced for categorical variables e.g. Ever smoked Asthma etc
To produce a Bar Chart click Graphs, then Bar and the following screen appears.
Click on Simple and then Define and the next screen will appear. You then move your chosen
variable from the left hand list to the Categorical Axis and press OK.
60
50
Count
40
30
20
10
0
Never
Ex Smoker
34
Curr. Smoker
Histograms
Histograms are produced for interval variables e.g. age. To produce a histogram click on Graphs
then Histogram and the following screen appears.
Click on the required variable, in this case FEV, in the left hand side list and press
then press
OK. If you require a normal curve to be drawn on the graph click on Display normal curve.
Frequency
20
15
10
Mean = 3.7938
Std. Dev. = 0.73936
N = 136
0
1.00
2.00
3.00
4.00
Measured FEV
35
5.00
6.00
Scatter Plots
Scatter plots show the joint behaviour of two interval variables. If you want to decide whether two
interval variables are related in any way you should first draw a scatter plot.
1.40
Unexposed
Exposure to Dust
1.20
fevratio
1.00
0.80
0.60
0.40
10
20
30
No of years smoked
36
40
50
Exposure Group
1.40
Unexposed
Exposure to Dust
Fit line for Unexposed
Fit line for Exposure to
Dust
1.20
fevratio
1.00
0.80
0.60
R Sq Linear = 0.111
R Sq Linear = 0.032
0.40
10
20
30
40
No of years smoked
37
50
The methods will be illustrated by the Foundry data set that was considered in Part I. The purpose
of this study was to examine whether dust increased respiratory morbidity. In this study the measure
of respiratory morbidity are Ever had asthma", Ever had bronchitis, Measured FEV and
Measured FVC. The variable Predicted FEV and Predicted FVC are the values that are
expected for a persons demographic characteristics including Age, Height and Sex. Exposure to
dust is measured by two variables Exposed/Un-exposed and dust levels recorded only for exposed
workers. Because smoking is a confounding factor in this study, smoking behaviour has been
recorded in terms of current smoking status (smknow), smoking history (smkever), and
consumption (cigno) and duration of smoking (cigyrs).
During this part of the practical you may need to refer to the notes from Part I. If you are starting
the tutorial at this point rather than continuing from Part I, you will need to open the dataset
L:\SPSS\Course\foundry.sav.
Categorical Variable
In the first part of the study we examined whether there was any relationship between exposure to
dust and smoking. Using the cross-tabs procedure we can generate the following table.
Do you smoke now * Exposure Group Crosstabulation
Do you smoke
now
No
Yes
Total
Count
% within Exposure Group
Count
% within Exposure Group
Count
% within Exposure Group
38
Exposure Group
Exposure
Unexposed
to Dust
43
39
68.3%
53.4%
20
34
31.7%
46.6%
63
73
100.0%
100.0%
Total
82
60.3%
54
39.7%
136
100.0%
From the table above it can be seen that the percentage of workers who currently smoke is higher
for those exposed to dust than those who are not, 47% as compared to 32%.
We will now examine whether respiratory symptoms as measured by the variable asthma relate to
smoking. Using cross-tabs procedure again we obtain the following table.
Ever had Asthma * Do you smoke now Crosstabulation
Ever had
Asthma
No
Yes
Total
Count
% within Do you
smoke now
Count
% within Do you
smoke now
Count
% within Do you
smoke now
Total
125
93.9%
88.9%
91.9%
11
6.1%
11.1%
8.1%
82
54
136
100.0%
100.0%
100.0%
39
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
Value
1.101b
.530
1.075
1.093
df
1
1
1
Asymp. Sig.
(2-sided)
.294
.467
.300
Exact Sig.
(2-sided)
Exact Sig.
(1-sided)
.344
.231
.296
136
The panel above gives the results of a chi-squared test of no association between asthma and
smoking. In interpreting this table one is concerned with three columns headed Asymp.Sig and
Exact Sig.. These columns give the p-values for the significance test. Firstly it is usually
recommended that you consider a 2-sided rather than 1-sided test. As one of the cells has an
expected count less than or equal to 5, it is recommended that we take the Fishers Exact Test value
as our result that is 0.344. Assuming the conventional 0.05 significance level, this result is
considered non-significant. In reporting results of statistical tests you are strongly recommended to
give the p-value rather than just write significant or non-significant. In reporting this we might
write there was no evidence of an association between smoking and asthma (Fishers Exact
p=0.344)." Had the expected count been greater than 5 and the table greater than 2 by 2 it is
suggested that you report the straight forward Chi-squared test p-value. If the expected count is
greater than 5 but the table is a 2 by 2 then report the continuity correction p-value.
Exercise Using the cross-tabs procedure examine whether there is a relationship between current
smoking status and bronchitis symptoms.
Are the expected numbers greater than 5 for all cells?
Fill in the spaces and delete as appropriate in the following statement:
Amongst those that currently smoked ___% had experienced symptoms of bronchitis whereas
___% of non-smokers experience such symptoms. This was statistically significant/non significant
at a 5% level using a two-tailed continuity corrected chi-squared test with p=______
Exercise Now use the cross-tabs procedure to examine the relationship between Exposure to dust
and symptoms of bronchitis and asthma. Record your conclusions below using either the continuity
corrected chi-squared or Fishers exact test as appropriate.
40
We have found no statistically significant relationship between exposure to dust and either asthma
or bronchitis symptoms. For bronchitis symptoms you should have obtained the following tables.
Ever had Bronchitis * Exposure Group Crosstabulation
No
Yes
Total
Count
% within Exposure Group
Count
% within Exposure Group
Count
% within Exposure Group
Exposure Group
Exposure
to Dust
Unexposed
59
62
93.7%
84.9%
4
11
6.3%
15.1%
63
73
100.0%
100.0%
Total
121
89.0%
15
11.0%
136
100.0%
Chi-Square Tests
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
Value
2.620b
1.807
2.735
2.601
df
1
1
1
Asymp. Sig.
(2-sided)
.106
.179
.098
Exact Sig.
(2-sided)
Exact Sig.
(1-sided)
.169
.088
.107
136
Whilst 15% (11/73) of the exposed worker had symptoms of bronchitis and only 6% (4/63) of nonexposed, this difference was not statistically significant at the 5% level (p=0.179). There are several
explanations for this. There may be no relationship between the exposure to dust and respiratory
disease. Alternatively, the study may have lacked statistical power to detect small differences. It
should be noted also that only 11% (15/136) of the sample reported such symptoms.
41
Select simple to get and transfer variable names in the usual way (see below).
1.20
fevratio
1.00
0.80
80
0.60
65
54
0.40
Unexposed
Exposure to Dust
Exposure Group
The box represents the inter-quartile range; the whiskers represent the range. The solid line in the
middle represents the median. This suggests that there is little difference between the dust exposed
and non-exposed workers. Other Analysis options we might use to compare the lung function of
exposed and non-exposed workers are Explore in the Descriptive section and the Means under
Compare Means.
Exercise Use Explore and Means options to compare lung function of exposed with non-exposed
workers using fvcratio and fevratio. Record the results below.
Mean
Standard
Deviation
Exposed
Non Exposed
42
Median
Max
Min
The following panel (below left) then appears into which we have selected fevrat as the test
variable and group defining the exposure.
Note (? ?) marks beside the variable name group. Click on Define Groups to add the codes for the
codes 0 and 1 for the two groups as shown (in the panel on the right).
43
The ability to select groups by choice of codes simplifies things when there are more than two
groups in the data set.
Clicking Continue then Ok gives the results below. The first summarises the data of the two
groups. The second presents two analyses. The first two columns of data, the Levenes F-Test of
equality of variance the assumption of a t-test is that the means for each group have the same
variance. The remainder summarise a t-test for equal and un-equal variance. For this data there is
no evidence that the variance as p=0.734 for the Levenes test. Therefore we take the first row as
the t-test results although in this case it makes little difference. The result can be summarised as
there was no evidence of increased FEV ratio for workers exposed to dust
(mean diff=0.0155, 95% c.i -0.032 to 0.063 p=0.519)
Group Statistics
FEVRAT
Exposure Group
Unexposed
Exposure to Dust
N
63
73
Mean
1.0158
1.0003
Std. Deviation
.12785
.14789
Std. Error
Mean
.01611
.01731
F
FEVRAT
Equal variances
assumed
Equal variances
not assumed
.116
Sig.
.734
df
Sig. (2-tailed)
Mean
Difference
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower
Upper
.647
134
.519
.0155
.02390
-.03181
.06272
.654
133.999
.514
.0155
.02364
-.03131
.06222
Exercise Compare mean FVC ratio for the exposed and non-exposed subjects using a t-test
From the analyses there appears to be no evidence that exposure to dust affects respiratory function.
It may be argued nevertheless that being categorised as "exposed" or "not exposed" is a crude
assessment for exposure. Dust exposure has been recorded for subjects in the exposed group. We
will now carry out some analysis on just the exposed subjects. First we select these from the data.
This was shown in P art I of the tutorial. Under Data we choose Select cases then If condition is
satisfied as shown below. We add the condition group=1 subsequent analysis will only be on the
dust exposed group.
44
Below displays a scatter plot of FEV ratio compared to dust for subjects for the exposed group.
Exposure Group
1.40
Exposure to Dust
1.20
fevratio
1.00
0.80
0.60
0.40
0.00
1.00
2.00
3.00
4.00
5.00
Current Exposure
There is some suggestion from this that respiratory function may be reduced for those with higher
exposure.
45
LINEAR REGRESSIONS
To test this we will use linear regression to fit a straight line of the form Y=A + BX.
Where Y is the dependent variable fevratio and X is independent variable respdust. If the gradient
B is negative, this would indicate reduced respiratory function with increased dust. To do this in
SPSS, whilst keep select cases to exposure group 1 go to the Regression then Linear as shown
46
There are several tables of results generated by the linear regression option. The most useful of
these is the table of coefficients shown below.
The coefficients are the values of A and B in the equation of the line fevratio=A+B.respdust
Coefficients(a)
Unstandardized
Coefficients
Model
1
(Constant)
Current exposure to dust
Standardized
Coefficients
B
1.069
Std. Error
.041
-.057
.031
Beta
-.212
Sig.
26.019
.000
-1.830
.071
The coefficient for respiratory is written -0.057. The column labelled Sig. gives the p-value for
the statistical test that the regression coefficients differ from zero. This tell us that the constant is
significantly different from zero which is not particularly interesting as we do not expect the
intercept of the line with the y-axis to be zero. It gives a p-value of 0.071for the test that the
gradient differs from zero. There is some suggestion of a negative gradient, but this is not
significant at the conventional 5% significance level.
The Model Summary table reproduced below tells one how well the line fits that data. The result
for R2 (written R square) is 0.045. This is an estimate of the proportion of the variance explained
by the model. A line that fits the data perfectly will have an R2 equal to 1. Where as a line that does
not explain anything in the data will have an R2 of zero. A value of R2 equal to 0.045 is therefore
not at all good only 4.5% of the variation in the data is being explained.
Model Summary
Model
1
R
R Square
.212a
.045
Adjusted
R Square
.032
Std. Error of
the Estimate
.14553
The conclusion that can be drawn from this is that whilst there is a slight suggestion of reduced
respiratory function with increased dust exposure, the evidence is weak.
Model Checking
The linear regression model described by the coefficients allows one to estimate a predicted value.
The difference between the observer value and the predicted value is called a residual. Where a
model fits badly the regression line will have large residuals. If we consider the scatter plot above
for FEV ratio compared to respiratory dust the residuals will be large. One of the assumptions of a
47
regression model is that the residuals will have a normal distribution. One way to check this
graphically is to use normal probability plot. This compares the residuals against a normal
distribution. Such a plot can be obtained from linear regression in SPSS as shown
Just select the normal probability plot options. Then the plot will be added to the output when it is
re-run. If the residuals are normally distributed the plotted points are on the diagonal line. The plot
below suggests that the data are approximately normally distributed. If the data were skewed the
points would bulge away from the line.
Normal P-P Plot of Regression Standardized Residual
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Exercise Examine the relationship between FVC ratio and dust levels using the methods above.
48
NON-PARAMETRIC METHODS
Where data is not normally distributed, statistical analyses that assume a normal distribution may be
inappropriate. This is especially a concern where the sample size is small (<50 in total). Variables
that are discrete (take only integer values) or have an upper or lower limit are by definition nonnormal. Sometimes the distribution of the data is approximately normal so this is not a problem,
particularly where the sample size is large, but for some variables it may be unreasonable to treat
the data as normally distributed. To illustrate this we will compare the number of cigarettes smoked
by "exposed" and "non-exposed" workers who currently smoke.
Before you start this you will need to reselect all cases as follows. To do this go to Data then Select
case and change the if condition to smknow=1 as shown.
The frequency table for cigs per day for current smokers is given below.
No of cigarettes per day
Valid
3
5
6
10
12
15
18
20
25
30
40
Total
Frequency
2
1
1
3
2
6
1
23
6
7
2
54
Percent
3.7
1.9
1.9
5.6
3.7
11.1
1.9
42.6
11.1
13.0
3.7
100.0
Valid Percent
3.7
1.9
1.9
5.6
3.7
11.1
1.9
42.6
11.1
13.0
3.7
100.0
Cumulative
Percent
3.7
5.6
7.4
13.0
16.7
27.8
29.6
72.2
83.3
96.3
100.0
More than half the sample (30/54) give values of 20 or 30 cigs. per day. The variable is not even
approximately normally distributed.
SPSS Version 14.0 05/07/2007
49
Exercise Use the Explore option under Descriptive statistics to determine the median and interquartile range for No Cigs consumed for Exposed and Non-dust exposed workers.
Suppose we wanted to compare the median number of cigarettes smoked per day by smokers
according to dust exposure group. The method one uses is the Mann-Whitney U-test, which is
called a rank based non-parametric method. The analysis is based not on the raw data values but
on the ranks of the data. The procedure ranks the values of numbers of cigarettes smoked from
smallest to largest.
The Mann-Whitney U-Test is carried out as follows. Under Analysis select Non-parametric to
give a choice of non-parametric procedure. As we are going to compare two groups the choice in
this case is then 2-Independent Groups. In this panel select, Mann-Whitney U-test, No cigs as the
test variable and Group as the grouping variable as shown.
Exposure Group
Unexposed
Exposure to Dust
Total
N
20
34
54
Mean Rank
25.45
28.71
Sum of Ranks
509.00
976.00
Test Statisticsa
Mann-Whitney U
Wilcoxon W
Z
Asymp. Sig. (2-tailed)
No of
cigarettes
per day
299.000
509.000
-.767
.443
In the tables above note the mean rank for each group and the significance level. The mean rank is
slightly lower for the unexposed group but this is not statistically significant at a 5% significance
50
level. Hence, we conclude that there is no difference between the median number of cigarettes
smoked by "exposed" and "non-exposed" workers. Before moving on to the next analysis we need
to select all subjects from the data menu.
To compare the mean measured FEV with mean predicted FEV we select a Paired samples T-test
in the Compare means submenu. This gives the panel below. Pairs of variables are selected by
highlighting the pair of variables in the window to the left then clicking on the select button to
transfer to the Paired Variable window as shown.
51
Pair
1
Measured FEV
Predicted FEV
Mean
3.7938
3.7552
N
136
136
Std. Error
Mean
.06340
.03912
Std. Deviation
.73936
.45619
Correlation
136
Sig.
.739
.000
Mean
Pair
1
Std. Deviation
Std. Error
Mean
.50632
.04342
.03860
95% Confidence
Interval of the
Difference
Lower
Upper
-.04726
.12447
df
.889
Sig. (2-tailed)
135
.376
It is readily apparent that mean measured FEC is slightly greater than mean predicted FEV.
However, we report this as Measured FVC was not significantly higher than measured FEV as
(meandiff=0.038, 95% c.i. -0.0473 to 0.1245, p>)
Exercise Compare the mean measured FVC with the mean predicted FVC.
The above method of analysis compares the mean value for the two variables. It does not tell one
how close individual values are for the same subject. A visual way in which one can do this is with
a scatter plot of the two variables as shown below. We get a visual impression that FEV and FVC
are quite strongly correlated. By choosing the same numerical range for both axes we can see also
that the values for FVC are systematically larger than for FEV.
8.00
7.00
Measured FVC
6.00
5.00
4.00
3.00
2.00
1.00
1.00
2.00
3.00
4.00
5.00
Measured FEV
52
6.00
7.00
8.00
No
Yes
Total
Count
% within Ever had
Bronchitis
% within Ever had Asthma
Count
% within Ever had
Bronchitis
% within Ever had Asthma
Count
% within Ever had
Bronchitis
% within Ever had Asthma
Total
121
93.4%
6.6%
100.0%
90.4%
12
72.7%
3
89.0%
15
80.0%
20.0%
100.0%
9.6%
125
27.3%
11
11.0%
136
91.9%
8.1%
100.0%
100.0%
100.0%
100.0%
Careful examination of this table reveals that 11% (15/136) of workers reported bronchitis whilst
only 8% (11/136) had asthma. These two proportions can be compared using McNemars test. This
is available under 2 Related samples in the Non-parametric sub menu. Select the pair of variables
in the same way as for a paired t-test and select the McNemar option.
This gives the following results
Test Statistics(b)
Ever had
Asthma &
Ever had
Bronchitis
N
136
Exact Sig. (2-tailed)
.503(a)
a Binomial distribution used.
b McNemar Test
The p-value for the McNemar test is not significant (p=0.503) so we conclude that symptoms of
bronchitis are no more common in this population than symptoms of asthma.
53
LOGISTIC REGRESSIONS
It is possible to apply regression techniques to a binary outcome e.g. Ever had Asthma Yes or No
and test the effect of predictors on this outcome. We use logistic regression to fit a straight line of
the form Y=A + BX.
Where Y is a link function called logit that converts the dependent variable Ever had Asthma from
a binary (0=No 1=Yes) variable into a probability of success (i.e. probability of Yes anwer) and X
is the standard independent variable respdust. Unlike before, the gradient B is a coefficient and can
not be interpreted as linear regression, alternatively the coefficient can be altered using the
exponential function so as to be considered an odds ratio (we will explain how to interpret this later)
To do this in SPSS, go to the Regression then Binary Logistic as shown
54
If the variable you wish to assess is a categorical variable then click Categorical and transfer the
appropriate variable into the appropriate box, note also click the radio button indicating Reference
Category as the first. Then finally click ok.
There are several tables of results generated by the linear regression option. The most useful of
these is the table of coefficients shown below.
The coefficients are the values of A and B in the equation of the line logit(asthma)=A+B.respdust
Variables in the Equation
B
Step
1(a)
respdust
.892
S.E.
.414
Wald
4.649
Constant
-3.190
.543
34.468
df
1
Sig.
.031
Exp(B)
2.439
.000
.041
The coefficient for respiratory is written 0.892 but the odds ratio is given as 2.439. An odds ratio
falls into three distinct groups, equal to 1, less than and greater than 1. An odds ratio of 1 indicates
no change in the likelihood of having the event (asthma) as the predictor changes (respdust). Odds
ratios greater than 1 indicate an increased likelihood of asthma and an odds ratio less than 1
indicates a decreased likelihood of asthma. So in this case we say that as respdust increase by one
unit the likelihood of having asthma increases by a multiple of 2.439. If the variable being tested
was a categorical variable say gender then one of the categories would be classed as the reference
category and the odds ratio would refer to the difference between the two groups. Say the odds ratio
was 3.2 and Males are the reference category we would say, Females are approximately 3 times as
likely as males to develop Asthma.
The column labelled Sig. gives the p-value for the statistical test that the odds ratios significantly
differ from one. This tells us that with a p-value of 0.031 for the test that the odds ratio for respdust
differs from one. There is strong suggestion of an increased likelihood of having asthma, this is
significant at the conventional 5% significance level.
55
Model Checking
In order to assess the Logistic regression models ability to represent the data we use a statistical test
called the Hosmer & Lemeshow test. It is based on grouping cases into 10 equally spaced groups of
risk and comparing the observed probability with the expected probability within each group. To
perform this in SPSS repeat the process as if you where performing the logistic regression, so
Analyse Regression Binary Logistic and click the box Options to get
Tick the box corresponding to the Hosmer-Lemeshow goodness of fit, then click Continue
followed by Ok. In the same output that you received before the following table should appear,
Hosmer and Lemeshow Test
Step
1
Chi-square
5.034
df
4
Sig.
.284
In this case we have only included one variable (respdust) you may require there to be several
variables in your model. The above table may therefore have several steps, in that case then the last
step will always be the result for the final model. The model is deemed unsuitable if the p-value (in
the Sig column) is less than 0.05, therefore in this case as the p-value is 0.284 the model is deemed
to be adequate.
Exercise Examine the relationship between Bronchitis and dust levels using the methods above.
It makes it easier and quicker to rerun an analysis if we make changes to the raw data.
The screen shot below illustrates part of the syntax file for the analysis that we have done.
SPSS Version 14.0 05/07/2007
56
This looks complicated but we do not need to learn this because SPSS can do this for us using the
menus. You may of notices a button paste on the interactive commands. We will illustrate this
using the t-test command. If we click paste instead of running the command then the syntax is
pasted into a new file.
The first time in a session that you click paste a new file is created. Using the same method as for
the t-test above you can add further commands to the syntax. It is possible to run the entire syntax
all at once or alternatively only specific commands.
To run the entire syntax click run on the options bar followed by all. To run a specific command,
highlight the command in the main window and click run followed by selection. A Syntax can be
edited through copy and paste commands or alternatively through more detailed written commands,
described in the help file. Because the syntax file is a separate file from SPSS needs to be saved
57
separately at the end of the session, using File and Save. At the start of a new session, you can
reopen an existing syntax file.
Please find located along with the data set on the L:drive and/or the website address a syntax that
will give the appropriate output for the SPSS exercises within the notes.
By clicking File, Open and Syntax as shown above the following screen will appear.
58
The syntax contains a set of codes and appropriate descriptions to produce the appropriate output
for each exercise (page 18 onwards) through out the notes. The descriptions of the tests performed
can be found in the syntax script after COMMENT.
Please note the first comment regarding the retrieval of the dataset.
59
Ordinal or Ordered
Binary and
and Normal
Categorical
Unordered
Categories
Comparison of
Box-plot
Box-plot or Cross-
Cross-tabulation
Independent Two
Independent groups t-
tabulation of ordered
Chi-squared test
Groups
test
categories
Mann-Whitney U-test
Comparison of more
Analysis of variance
Kruskal Wallis
Cross-tabulation
(ANOVA)
analysis of Variance*
Chi-squared test
Comparison of two
Wilcoxon Matched
McNemars Test
Pairs
related outcomes
Relationship between
Scatter plot
Spearman correlation
Phi coefficient
a dependent variable
Regression
or Kendall's
Logistic Regression
Pearson's correlation
correlation coefficient
independent
coefficient
variables
* Not illustrated
60