Problem Sheet Week 6
Problem Sheet Week 6
This week we are looking at associations between variables – does the value of x depend on the
value of y? We will look at this question for both categorical variables (like sex or ethnicity) and
interval variables (like height or exam scores).
Note – these exercises require you to use a computer spread sheet program such as Excel of a free
online spread sheet program like Google docs. They are not optional. Write up your answers and
hand them in. Where you are asked to calculate quantities such as the standard deviation, in your
writeup you should give the equation you used and the calculated value. Where you are asked to
make graphs, please copy and paste them into your writeup, or if writing up by hand, print them out
and hand in with your writeup.
Note – in many cases you will need to do repeated calculation on lots of data, for example getting
(x − x) for every prisoner in Exercise 2, or every country in exercise 3. You can do this using
formulae and functions in Excel.
If you have not learned to use equations and functions in a spread sheet in school, you need to learn
as this is a skill you need for your practical classes and also will be needed in many jobs in the Real
World.
a) A researcher measures weight (in stone) and height (in inches) for men. She calculates the
correlation and covariance. She then decides to convert her data to metric units, kilograms
and centimetres. One kilogram is 0.157 stone and one centimetre is 0.39 inches. What will
happen to the correlation and covariance?
b) Spearman’s rank correlation coefficient can be used when the assumptions for Pearson’s
correlation are not met. For each of the two datasets shown, state why Pearson’s correlation
is unsuitable and explain briefly why correlating the ranks rather than the data themselves
solves the problem.
[3]
i) ii)
In a classic study by Clark and Clark (1939), African-American children were shown one black doll and
one white doll and asked which one they wanted to play with. Out of 252 children, 169 chose the
white doll and 83 chose the black doll.
a) Use a z-test for proportions to determine more children chose the white doll than would be
expected due to chance.
In 1970, Hraba and Grant carried out a similar study. 89 African-American children were offered a
choice of 4 dolls (2 white, 2 black). 28 children chose a white doll and 61 chose a black doll.
b) Use a z test for proportions to determine whether children were more likely to choose the black
doll in Hraba and Grant’s study than Clark and Clark’s.
c) Construct a 2x2 contingency table for choice (black doll or white doll) by year (1939, 1970)
d) Carry out a Chi-Square test to see if there is a dependency between year and choice of doll. Use
the alpha level of 0.05. Set out your work carefully including hypotheses, working for the calculation
of the test statistic and how you determined the degrees of freedom.
e) What are the differences between the z-test approach and the Chi-square approach?
f) Say I introduced a Chinese doll to the experiment, could I still use both tests?
Testing for association – a straight line relationship
Exercise 2: Correlation and covariance
In the lecture we heard about the data on heights of 3000 prisoners that Student used to verify the t
distribution. In fact, the researcher who collected the data (MacDonell 1902) also measured the
middle finger of each prisoner. Student derived the sampling distribution for Person’s correlation r
(which gives you the formula for testing the significance of a correlation in your formula book), and
tested it on these data.
https://docs.google.com/spreadsheets/d/1PplCGDXAvS1gh3BXvtr4Mb63fjdGXLVJVlfohB5Y79M/edit
?usp=sharing
a) Plot a scatter plot of height against middle finger length. Notice anything odd about this?
Comment.
b) Calculate the covariance between the two measures. Include in your written report the formula
you used (from the formula book section 11)
To do this you will need to first work out the means 𝑥 and 𝑦, my summing all the values in the
column of data and dividing by the number of values.
Then you can add a column containing (𝑥 − 𝑥) for each prisoner by entering a formula, and another
column containing (𝑦 − 𝑦) for each prisoner in the same way. Add another new column in which
you multiply these together.
Finally you can use the SUM function to add up all the entries in the column as necessary according
to the formula in the formula book.
c) Work out the standard deviation for height and finger length, and use these to calculate the
correlation coefficient r (see formula book section 11).
d) Take a look at the axes of your scatter plot. What units do you think were used for height? What
about finger length?
e) Convert height and finger length to cm and recalculate the covariance and correlation. Only one of
these should change. Which one? Explain why with reference to the formula for covariance and
correlation.
Exercise 3: More Correlation
The United Nations publishes an annual Human Development Report, gathering data on indices of
development such as health (life expectancy, infant mortality), education, poverty and inequality.
They aggregate data on life expectancy, years of education and income per capita to calculate a
Human Development Index and categorize countries as very high, high, medium or low
https://docs.google.com/spreadsheets/d/1bopmmS-
7lqcsdNK0DwSBEg5Y2NOcK_2OJqkoQiyYHYc/edit?usp=sharing
a) Make a scatter plot comparing the proportion of males and females with secondary education
Let x be the proportion of females with secondary education and y be the proportion of males with
secondary education
b) Work out the covariance sxy between years of education for males and females across all countries
c) Work out the standard deviation in years of education for males sx and females sy
d) What is the correlation coefficient, Pearson’s r, the proportions of females and males with
secondary education over all countries?
f) Calculate the correlation for each HDI group separately and comment
Testing for association – non-parametric correlation
Exercise 4 – Rank Correlation
https://docs.google.com/spreadsheets/d/1f4_2lPcoy_GTp1wyff5EchSI-
MfDtBetqEqh6SXv_RE/edit?usp=sharing
This file contains data on the proportion of females having some secondary education, and the
fertility rate (number of children per woman) for many countries. It also contains rankings for each
variable.
It is proposed that when women are educated, the number of children per family is lower (and in
general, poverty is lowered as a consequence).
a) Make a scatter plot of fertility vs. percentage of women with secondary education. Think about
which variable should be on the x and y axes in relation to the hypotheses above.
d) Make a scatter plot of the ranks and comment, in relation to your answer to c
Bystander apathy is a phenomenon in which observers are less likely to help the victim of a crime or
accident when other observers are also present.
Consider an experiment in which participants think they are supposed to be memorizing pairs of
words. In one condition [NO BYSTANDERS] the participant is alone with the experimenter. In another
condition [4 BYSTANDERS], there are four other ‘participants’ present (in fact these are actors
pretending to be participants), who we shall call the bystanders. Part way through the memory task,
the experimenter pretends to be having a seizure. The real experimental question is whether the
participant leaves the room to seek assistance for the experimenter or not.
These are the number of participants seeking assistance, or not, in each condition:
Let A be the event that the participant seeks assistance for the experimenter. Let B be the event that
there are bystanders present.
a) Name two statistical tests that could be used to determine whether the proportion of
participants seeking assistance differed when bystanders were present vs. absent.
b) For each of the above tests, the null hypothesis is that the proportion of participants seeking
assistance is equal, whether or not bystanders are present, ie p(A|B) = p(A|Bc) = pA. For one
test, there is a single possible alternative hypothesis. For the other test there are three
possible alternative hypotheses. State the possible alternative hypotheses for each test.
c) Carry out each of the tests named in part c. In each case clearly state your hypotheses, test
statistic, the critical value and your conclusion.
d) Say the number of subjects in the experiment was doubled and the results were as follows.
State, without calculation but giving a reason, whether you would expect the p-value for each
test to go up or down.
e) Covariance and correlation are measures of association for continuous variables. What is the
difference between covariance and correlation and when should you use each one?