STATS 731

Tutorial 9, 2019

Due by 23:59, Sunday May 26th

Question — Stolen from the STATS 20X midterm!

Thanks to Andrew Balemi for this.

Comment from Brendon

: Do not take this exercise as

an endorsement of the study itself or the beneﬁts or lack thereof of chess relative to alternative

uses of one’s time. Things are always more complicated than stats questions let on.

“The USA Junior Chess Olympics Research: Developing Memory and Verbal Reasoning” (New

Horizons for Learning, April 2001; (available at

www.newhorizons.org

) describes a study in

which sixth-grade students who had not previously played chess participated in a program in

which they took chess lessons and played chess daily for 9 months. Each student took a memory

test (the Test of Cognitive Skills) before starting the chess program and again at the end of

the 9-month period. The data below has the diﬀerences in scores after the chess program and

before, for 12 students.

data = list(x=c(340, 180, 210, 100, 100, 225, 90,

225, 240, 55, -35, 5), N=12)

Assumptions

Throughout this tutorial, use the following sampling distribution:

x

i

| µ, σ ∼ Normal

µ, σ

2

(1)

and let

H

0

and

H

1

be the propositions that

µ

= 0 and

µ 6

= 0 respectively. When a prior is

needed for σ, use

log σ ∼ Normal

0, 5

2

, (2)

which implies σ is likely to be within a few orders of magnitude of unity.

1

Tasks

(i)

Do a classical t-test using R (this should be very easy). Write down the p-value, and write

down the notation for what this probability is (i.e., it’s the probability of what conditional

on what)

1

.

(ii)

Do Bayesian model comparison of

H

0

vs.

H

1

. You can compute marginal likelihoods using

either Nested Sampling or another method if you prefer. For

H

1

, use a Uniform(-1000000,

1000000) prior for

µ

. Calculate the posterior odds ratio assuming a prior odds ratio of 1,

and write down the probability notation for what this quantity is.

(iii)

The apparent contradiction between the results of (i) and (ii) is sometimes called the

Jeﬀreys-Lindley paradox. Explain, in words and/or diagrams, why the conclusion of

part (ii) is actually valid for a reasoner whose prior beliefs are really described by the

Uniform(-1000000, 1000000) prior.

(iv)

Harold Jeﬀreys had a suggestion for problems like this. The parameter

σ

already sets a

scale for the problem (“we’re dealing with stuﬀ around 100–300”), so conditional on

σ

,

we can use

σ

to give us a suitable non-extreme prior for

µ

. He suggested using a Cauchy

distribution because of the heavy tails. Re-do part (ii) using

µ ∼ Cauchy

(0

, σ

) as the prior

for

µ

given

σ

.

Hint

: For R coding purposes, a Cauchy distribution is a

t

distribution with

df=1. Hint 2: Use κ = µ/σ as a parameter instead of µ, and then let µ ← κσ.

1

While a Bayesian can write down what calculation a frequentist has done, they still wouldn’t agree on what

it represents. For a frequentist, the p-value is an experimentally veriﬁable relative frequency. For a Bayesian,

it’s the degree of belief in a proposition in this particular case, imagining that you knew

H

0

to be true. If

the experimentally veriﬁable relative frequency exists, the Bayesian could talk about it too, but should give it

another symbol, maybe

f

for frequency instead of

p

for probability, even though the maths relating diﬀerent

f

s

to each other and the maths relating diﬀerent

p

s to each other would be the same [i.e., the laws of probability

theory], they would remain conceptually distinct things.

2

STATS 731

Tutorial 9, 2019

Solutions

(i)

Here’s my t-test. The p-value is quite low, indicating evidence against

H

0

. Looking at the

data, which has only a single negative value, also suggests evidence against H

0

.

> t.test(data$x)

One Sample t-test

data: data$x

t = 4.564, df = 11, p-value = 0.0008114

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

74.8575 214.3092

sample estimates:

mean of x

144.5833

The p-value is the probability of a more extreme value of the test statistic if

H

0

is true.

The test statistic here is

t(x

1

, ..., x

N

) =

¯x

s/

√

N

(1)

where

s

is the sample standard deviation (the version with

N −

1 in the denominator).

The observed value of t was 4.564.

The p-value is the probability of a more extreme value than this, under H

0

:

P (t ≥ 4.564 or t ≤ −4.564 |H

0

, σ) (2)

I’ve conditioned on

σ

too because a frequentist is unable to marginalise it out. However,

the value ultimately does not depend on

σ

(i.e., you’d get the same probability for any

given value of σ).

(ii)

I’m going to use Nested Sampling here. For

H

0

, the required functions and variables for

model.R given below.

num_params = 1

widths = c(5)

log_prior = function(params)

1

{

logp = 0

logp = logp + dnorm(params["log_sigma"], 0, 5, log=TRUE)

return(logp)

}

from_prior = function()

{

params = rep(NA, 1)

params[1] = 5*rnorm(1)

names(params) = c("log_sigma")

return(params)

}

log_likelihood = function(params)

{

sigma = exp(params["log_sigma"])

logl = sum(dnorm(data$x, mean=0, sd=sigma, log=TRUE))

return(logl)

}

For H

1

, the functions and variables are

num_params = 2

widths = c(5, 2000000)

log_prior = function(params)

{

logp = 0

logp = logp + dnorm(params["log_sigma"], 0, 5, log=TRUE)

logp = logp + dunif(params["mu"], -1000000, 1000000, log=TRUE)

return(logp)

}

from_prior = function()

{

params = rep(NA, 1)

params[1] = 5*rnorm(1)

params[2] = runif(1, -1000000, 1000000)

names(params) = c("log_sigma", "mu")

return(params)

}

log_likelihood = function(params)

{

2

sigma = exp(params["log_sigma"])

logl = sum(dnorm(data$x, mean=params["mu"], sd=sigma, log=TRUE))

return(logl)

}

For the runs I used

num particles=1000

because I wanted accuracy for these solutions,

and I used depth=10.0 for H

0

and depth=20.0 for H

1

. The results were:

H0 log marginal likelihood: ln(Z) = -83.14519 +- 0.05796811.

H1 log marginal likelihood: ln(Z) = -86.63857 +- 0.112621.

The posterior odds ratio in favour of H

0

is

P (H

0

|x

1

, ..., x

N

)

P (H

1

|x

1

, ..., x

N

)

=

P (H

0

)

P (H

1

)

×

p(x

1

, ..., x

N

|H

0

)

p(x

1

, ..., x

N

|H

1

)

(3)

≈ 1 ×

exp(−83.145)

exp(−86.639)

(4)

≈ 32.90. (5)

The null hypothesis is now about 30 times more likely than the alternative.

(iii)

The uniform prior gives very low prior probability to the proposition that

H

1

is true but

µ

is close to zero. For example, the prior probability that

µ ∈

[

−

10

,

10] is 10

−5

. The

reasoner with this prior can explain the data in two ways: either

H

0

is true and the data

happened to be a bit away from zero (quite unlikely), or

H

1

is true but

µ

happens to

be quite close to zero (

even more unlikely

). The reasoner has two bad explanations

available and ends up favouring the less bad one.

Another way of understanding this is through a schematic diagram:

3

H parameter space

0

H parameter space

1

Before conditioning on the data,

each rectangle has half the probability

H parameter space

0

H parameter space

1

The data rules out red parts and

rules in green parts (roughly).

After conditioning on the data,

is supported because less of its

original 1/2 of the probability was

ruled out by the data.

0

H

Here’s an analogous discrete problem. Suppose a hotel has 1000 rooms numbered in order,

and one of them contains precious diamonds. You then discover (somehow) that if the

room with the diamonds is in a room above 500, it must be in a room from 990–1000,

whereas if it’s in a room less than or equal to 500, it must be in room 1–200. This would

lead you to favour the hypothesis that the room number with the diamonds is

≤

500 as

opposed to

>

500, because that covers more possibilities.

Note

: If you think the 990–1000

4

information is ‘suspiciously precise’ and start to suspect the diamonds are there, your

intuition is putting more information into the problem than I have, which might make

you disagree with my conclusion. If that happened, forget this example (or look up the

Sleeping Beauty problem and enjoy).

5

(iv)

Here is my code for the new model with the more informative prior which implies that

knowing σ would give us at least some rough information about µ.

num_params = 2

widths = c(5, 1)

log_prior = function(params)

{

logp = 0

logp = logp + dnorm(params["log_sigma"], 0, 5, log=TRUE)

logp = logp + dt(params["kappa"], df=1, log=TRUE)

return(logp)

}

from_prior = function()

{

params = rep(NA, 1)

params[1] = 5*rnorm(1)

params[2] = rt(1, df=1)

names(params) = c("log_sigma", "kappa")

return(params)

}

log_likelihood = function(params)

{

sigma = exp(params["log_sigma"])

mu = params["kappa"]*sigma

logl = sum(dnorm(data$x, mean=mu, sd=sigma, log=TRUE))

return(logl)

}

The result was

Marginal likelihood: ln(Z) = -78.92839 +- 0.07054974.

so the Bayes Factor and hence the posterior odds ratio is now 67.6

in favour of H

1

,

agreeing more with intuition.

6