SPSS Notes
SPSS Notes
Basic definitions
Descriptive statistics are statistics that describe a variable's central tendency and dispersion.
Central tendency represents the values contained in the center of the data, such as the Mean, the
Median, the Mode, etc.
Mean = average value.
Median = middle value when data set is ordered from the smallest to the greatest value.
Mode = most occurring value.
Dispersion represents the distribution of the variable's responses.
Minimum = smallest value.
Maximum = greatest value.
Range = Maximum – Minimum.
Standard deviation (écart-type) = spread of data around the Mean, aka how far away values tend
to deviate from the Mean. For example, if the Mean = 18 and the Std. Dev. = 3.8, then the data
spreads between 21.8 and 14.2.
Outliers (Extreme Values) = the 5 lowest values and the 5 highest values, observations that are
distant from normal observations within one complete dataset.
A Percentile is the value below which a given percentage of observations in a group of
observations falls. For example, the 25th percentile is the value below which 25% of the
observations may be found. If the Weighted Average of the 25th Percentile is 164cm, then 25% of
the population has a height below 164cm.
A Confidence Interval has a Lower Limit (or Bound) and an Upper Limit. When a 95%
Confidence Interval is bound between 164 (Lower) and 167 (Upper), this means I’m 95%
confident that, if I revisit the same or a similar sample, the Mean will be situated between 164
and 167.
2. Variable measures
Scale: values represent ordered categories with a meaningful metric, so that distance
comparisons between values are appropriate. Examples of scale variables include age
in years, income or test score. For example, in a classroom of 60 students, each one
would be given a test, and therefore Scale is used to determine the average score for
the class, or the highest and lowest score in the class, etc.
Nominal: values represent categories with no intrinsic ranking. Where do you live? 1-
Suburbs, 2- City, 3- Town. What is your gender? M- Male, F-Female.
Ordinal: values represent categories with some intrinsic ranking. How satisfied are
you with our services? Very Unsatisfied – 1, Unsatisfied – 2, Neutral – 3, Satisfied –
4, Very Satisfied – 5.
3. Transform tab
This tab helps me transform data into meaningful values.
Compute Variable acts like a calculator. In the Compute Variable window can be
found various functions within each function group. In the Statistical group are the
Mean, the Median, the Mode, etc. The resulting value will be a part of a new variable
that will appear in the SPSS table.
Recode into Different Variables: it transforms an original variable into a new
variable, but the changes do not overwrite the original variable; they are instead
applied to a copy of the original variable under a new variable name. I can also go to
Old and New Variables to specify how I wish to recode the values for the selected
variable as a way to classify data.
4. Edit tab
This tab helps me easily navigate my data and make changes.
Go to Variable.
Go to Case.
Find and Replace: it’s used, for example, when I want to replace all the 5s in my
dataset with 6s instead.
5. Data tab
Sort Cases.
Sort Variables: I can put variables in an Ascending or a Descending order. I can also
select Transpose, aka turn rows into columns and vice versa.
Select Cases: a window will open, from which I can do many things. For example, if I
want to focus only on the data regarding females, I can insert an “if condition” that
says “if value = 2 (2 = Female) and SPSS will select only the cases regarding females.
I can also select Random Sample of Cases and SPSS will choose cases randomly for
me (for purposes of sampling).
Split File: for example, I can separate the data of males and females and display the
resulting data as either separate groups or comparative groups.
7. Output: definitions
In a frequency table, the Frequency column reports the number of cases that fall into each
category of the variable being analyzed (number of times something occurs in the population).
The Percent column provides a percentage of the total cases that fall into each category.
The Valid Percent column is a percentage that does not include missing cases.
The Cumulative Percent column adds the percentages of each category from the top of the table
to the bottom, culminating in 100%. This is more useful when the variable of analysis is ranked
or ordinal, as it makes it easy to get a sense of what percentage of cases fall below each rank, aka
percentiles.
8. Missing values
There are 2 types of missing values: system and user.
System missing values are values that are completely absent from the data. They are shown as
periods (.) in data view. System missing values are only found in numeric variables. A
respondent skipped some questions, some values weren't recorded, etc.
User missing values are values that the SPSS user specifically excludes. For categorical
variables, answers such as “don't know” or “no answer” are typically excluded from analysis. For
metric variables, unlikely values (a reaction time of 50ms or a monthly salary of € 9,999,999) are
usually set as user missing. These values are invisible while editing or analyzing data.
9. Analyzing the normal distribution curve to write a rapport
11. Boxplot
A Boxplot contains the Median (middle dash), the Minimum (bottom dash) and Maximum (top
dash) values, and the Quartiles.
The 1st Quartile goes from the Minimum to the bound of the bottom box (25th
Percentile),
the 2nd from that bound to the Median (50th Percentile),
the 3rd from the Median to the bound of the top box (75th Percentile),
and the 4th from that bound to the Maximum.
Interquartile Range (IQR) = the value of the 75th Percentile – the value of the 25th Percentile.
A spread out Boxplot tends not to have Outliers, while a small Boxplot tends to have Outliers.
16. T-test
A t-test is used to compare the difference of the means of two groups. To perform the operation:
Analyze tab Compare Means. There are 3 types: one-sample, paired-samples, and
independent-samples.
The one-sample t-test determines whether the sample mean is statistically different than a known
or hypothesized population mean (the “test value”). Example: I’d like to find the difference
between the average IQ score of my sample (105) and the average IQ score of the general
population (100).
In the output table, Mean Difference = variable mean – test value = 105 – 100 = 5, which I will
get in 95% of the cases (within the Lower and Upper bounds of the 95% Confidence Interval).
df = degrees of freedom = N – 1 (as in 1 sample). The result represents the number of values I
can change while keeping the same mean.
To interpret the Output table, we need to keep the t distribution table handy. We commonly use
alpha = 0.05 in the two-tailed test column.
First part of the interpretation: I check in the output table the t value and the df value, then go to
the convenient df column in the t distribution table, and compare the value found there (critical
value) to the t value.
Second: I compare the output Sig value to the corresponding P value (0.01 or 0.05).
Third: I check whether my confidence interval crosses 0. If it doesn’t, then my data’s mean will
never reach the test value. If it does, then it may.
The means are significantly different when 1) the t value is bigger than the critical value, 2) the
sig is smaller than the P value, and 3) the 95% CI doesn’t cross 0.
In the paired-samples t-test, I have 2 groups (before and after) within the same sample. Example:
the IQ score of the same people before and after college: After Mean – Before Mean.
A positive result means that After Mean is greater than Before Mean. A negative result means
that After Mean is smaller than Before Mean.
Paired measures: I’m measuring two independent variables within the same sample, such as
height and weight.
In the independent-samples t-test (where two or more samples are analyzed), I consider only the
equal-variances assumed row (values with different means but similar variances) in the output
table. But remember: if the Sig of Levene’s test is greater than the P value (0.05), I opt for the
equal-variances assumed row. If not, then I opt for the equal-variances not assumed row. Here,
df = N – 2 (or 3, or 4… depending on the number of samples I have).