STATS 731
Tutorial 9, 2019
Due by 23:59, Sunday May 26th
Question Stolen from the STATS 20X midterm!
Thanks to Andrew Balemi for this.
Comment from Brendon
: Do not take this exercise as
an endorsement of the study itself or the benefits or lack thereof of chess relative to alternative
uses of one’s time. Things are always more complicated than stats questions let on.
“The USA Junior Chess Olympics Research: Developing Memory and Verbal Reasoning” (New
Horizons for Learning, April 2001; (available at
www.newhorizons.org
) describes a study in
which sixth-grade students who had not previously played chess participated in a program in
which they took chess lessons and played chess daily for 9 months. Each student took a memory
test (the Test of Cognitive Skills) before starting the chess program and again at the end of
the 9-month period. The data below has the differences in scores after the chess program and
before, for 12 students.
data = list(x=c(340, 180, 210, 100, 100, 225, 90,
225, 240, 55, -35, 5), N=12)
Assumptions
Throughout this tutorial, use the following sampling distribution:
x
i
| µ, σ Normal
µ, σ
2
(1)
and let
H
0
and
H
1
be the propositions that
µ
= 0 and
µ 6
= 0 respectively. When a prior is
needed for σ, use
log σ Normal
0, 5
2
, (2)
which implies σ is likely to be within a few orders of magnitude of unity.
1
Tasks
(i)
Do a classical t-test using R (this should be very easy). Write down the p-value, and write
down the notation for what this probability is (i.e., it’s the probability of what conditional
on what)
1
.
(ii)
Do Bayesian model comparison of
H
0
vs.
H
1
. You can compute marginal likelihoods using
either Nested Sampling or another method if you prefer. For
H
1
, use a Uniform(-1000000,
1000000) prior for
µ
. Calculate the posterior odds ratio assuming a prior odds ratio of 1,
and write down the probability notation for what this quantity is.
(iii)
The apparent contradiction between the results of (i) and (ii) is sometimes called the
Jeffreys-Lindley paradox. Explain, in words and/or diagrams, why the conclusion of
part (ii) is actually valid for a reasoner whose prior beliefs are really described by the
Uniform(-1000000, 1000000) prior.
(iv)
Harold Jeffreys had a suggestion for problems like this. The parameter
σ
already sets a
scale for the problem (“we’re dealing with stuff around 100–300”), so conditional on
σ
,
we can use
σ
to give us a suitable non-extreme prior for
µ
. He suggested using a Cauchy
distribution because of the heavy tails. Re-do part (ii) using
µ Cauchy
(0
, σ
) as the prior
for
µ
given
σ
.
Hint
: For R coding purposes, a Cauchy distribution is a
t
distribution with
df=1. Hint 2: Use κ = µ/σ as a parameter instead of µ, and then let µ κσ.
1
While a Bayesian can write down what calculation a frequentist has done, they still wouldn’t agree on what
it represents. For a frequentist, the p-value is an experimentally verifiable relative frequency. For a Bayesian,
it’s the degree of belief in a proposition in this particular case, imagining that you knew
H
0
to be true. If
the experimentally verifiable relative frequency exists, the Bayesian could talk about it too, but should give it
another symbol, maybe
f
for frequency instead of
p
for probability, even though the maths relating different
f
s
to each other and the maths relating different
p
s to each other would be the same [i.e., the laws of probability
theory], they would remain conceptually distinct things.
2
STATS 731
Tutorial 9, 2019
Solutions
(i)
Here’s my t-test. The p-value is quite low, indicating evidence against
H
0
. Looking at the
data, which has only a single negative value, also suggests evidence against H
0
.
> t.test(data$x)
One Sample t-test
data: data$x
t = 4.564, df = 11, p-value = 0.0008114
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
74.8575 214.3092
sample estimates:
mean of x
144.5833
The p-value is the probability of a more extreme value of the test statistic if
H
0
is true.
The test statistic here is
t(x
1
, ..., x
N
) =
¯x
s/
N
(1)
where
s
is the sample standard deviation (the version with
N
1 in the denominator).
The observed value of t was 4.564.
The p-value is the probability of a more extreme value than this, under H
0
:
P (t 4.564 or t 4.564 |H
0
, σ) (2)
I’ve conditioned on
σ
too because a frequentist is unable to marginalise it out. However,
the value ultimately does not depend on
σ
(i.e., you’d get the same probability for any
given value of σ).
(ii)
I’m going to use Nested Sampling here. For
H
0
, the required functions and variables for
model.R given below.
num_params = 1
widths = c(5)
log_prior = function(params)
1
{
logp = 0
logp = logp + dnorm(params["log_sigma"], 0, 5, log=TRUE)
return(logp)
}
from_prior = function()
{
params = rep(NA, 1)
params[1] = 5*rnorm(1)
names(params) = c("log_sigma")
return(params)
}
log_likelihood = function(params)
{
sigma = exp(params["log_sigma"])
logl = sum(dnorm(data$x, mean=0, sd=sigma, log=TRUE))
return(logl)
}
For H
1
, the functions and variables are
num_params = 2
widths = c(5, 2000000)
log_prior = function(params)
{
logp = 0
logp = logp + dnorm(params["log_sigma"], 0, 5, log=TRUE)
logp = logp + dunif(params["mu"], -1000000, 1000000, log=TRUE)
return(logp)
}
from_prior = function()
{
params = rep(NA, 1)
params[1] = 5*rnorm(1)
params[2] = runif(1, -1000000, 1000000)
names(params) = c("log_sigma", "mu")
return(params)
}
log_likelihood = function(params)
{
2
sigma = exp(params["log_sigma"])
logl = sum(dnorm(data$x, mean=params["mu"], sd=sigma, log=TRUE))
return(logl)
}
For the runs I used
num particles=1000
because I wanted accuracy for these solutions,
and I used depth=10.0 for H
0
and depth=20.0 for H
1
. The results were:
H0 log marginal likelihood: ln(Z) = -83.14519 +- 0.05796811.
H1 log marginal likelihood: ln(Z) = -86.63857 +- 0.112621.
The posterior odds ratio in favour of H
0
is
P (H
0
|x
1
, ..., x
N
)
P (H
1
|x
1
, ..., x
N
)
=
P (H
0
)
P (H
1
)
×
p(x
1
, ..., x
N
|H
0
)
p(x
1
, ..., x
N
|H
1
)
(3)
1 ×
exp(83.145)
exp(86.639)
(4)
32.90. (5)
The null hypothesis is now about 30 times more likely than the alternative.
(iii)
The uniform prior gives very low prior probability to the proposition that
H
1
is true but
µ
is close to zero. For example, the prior probability that
µ
[
10
,
10] is 10
5
. The
reasoner with this prior can explain the data in two ways: either
H
0
is true and the data
happened to be a bit away from zero (quite unlikely), or
H
1
is true but
µ
happens to
be quite close to zero (
even more unlikely
). The reasoner has two bad explanations
available and ends up favouring the less bad one.
Another way of understanding this is through a schematic diagram:
3
H parameter space
0
H parameter space
1
Before conditioning on the data,
each rectangle has half the probability
H parameter space
0
H parameter space
1
The data rules out red parts and
rules in green parts (roughly).
After conditioning on the data,
is supported because less of its
original 1/2 of the probability was
ruled out by the data.
0
H
Here’s an analogous discrete problem. Suppose a hotel has 1000 rooms numbered in order,
and one of them contains precious diamonds. You then discover (somehow) that if the
room with the diamonds is in a room above 500, it must be in a room from 990–1000,
whereas if it’s in a room less than or equal to 500, it must be in room 1–200. This would
lead you to favour the hypothesis that the room number with the diamonds is
500 as
opposed to
>
500, because that covers more possibilities.
Note
: If you think the 990–1000
4
information is ‘suspiciously precise’ and start to suspect the diamonds are there, your
intuition is putting more information into the problem than I have, which might make
you disagree with my conclusion. If that happened, forget this example (or look up the
Sleeping Beauty problem and enjoy).
5
(iv)
Here is my code for the new model with the more informative prior which implies that
knowing σ would give us at least some rough information about µ.
num_params = 2
widths = c(5, 1)
log_prior = function(params)
{
logp = 0
logp = logp + dnorm(params["log_sigma"], 0, 5, log=TRUE)
logp = logp + dt(params["kappa"], df=1, log=TRUE)
return(logp)
}
from_prior = function()
{
params = rep(NA, 1)
params[1] = 5*rnorm(1)
params[2] = rt(1, df=1)
names(params) = c("log_sigma", "kappa")
return(params)
}
log_likelihood = function(params)
{
sigma = exp(params["log_sigma"])
mu = params["kappa"]*sigma
logl = sum(dnorm(data$x, mean=mu, sd=sigma, log=TRUE))
return(logl)
}
The result was
Marginal likelihood: ln(Z) = -78.92839 +- 0.07054974.
so the Bayes Factor and hence the posterior odds ratio is now 67.6
in favour of H
1
,
agreeing more with intuition.
6