CHAPTER 4. PARAMETER ESTIMATION I: BAYES’ BOX 25
given
N
= 5, and then update again to take into account
x
as well as
N
. However, the
first update would be a bit weird. Why would knowing the number of trials tell you
anything about the success probability? Effectively, what we have done in our analysis
is assume that
N
= 5 is prior information that lurks in the background the whole time.
Therefore our uniform prior for
θ
already “knows” that
N
= 5, so we didn’t have to
consider P (N = 5|θ) in the likelihood. This subtlety usually doesn’t matter much.
4.2 Prediction in the Bus Problem
We have now seen how to use information (data) to update from a prior distribution to a
posterior distribution when the set of possible parameter values is discrete. The posterior
distribution is the complete answer to the problem. It tells us exactly how strongly we
should believe in the various possible solutions (possible values for the unknown parameter).
However, there are other things we might want to do with this information. Predicting
the future is one! It’s fun, but risky. Here we will look at how prediction is done using the
Bayesian framework, continuing with the bus example. To be concrete, we are interested
in the following question: what is the probability that I will catch the right bus tomorrow?.
This is like trying to predict the result of a future experiment.
In the Bayesian framework, our predictions are always in the form of probabilities or
(later) probability distributions. They are usually calculated in three stages. First, you
pretend you actually know the true value of the parameters, and calculate the probability
based on that assumption. Then, you do this for all possible values of the parameter
θ
(alternatively, you can calculate the probability as a function of
θ
). Finally, you combine
all of these probabilities in a particular way to get one final probability which tells you
how confident you are of your prediction.
Suppose we knew the true value of
θ
was 0.3. Then, we would know the probability of
catching the right bus tomorrow is 0.3. If we knew the true value of
θ
was 0.4, we would
say the probability of catching the right bus tomorrow is 0.4. The problem is, we don’t
know what the true value is. We only have the posterior distribution. Luckily, the sum
rule of probability (combined with the product rule) can help us out. We are interested in
whether I will get the good bus tomorrow. There are 11 different ways that can happen.
Either
θ
= 0 and I get the good bus, or
θ
= 0
.
1 and I get the good bus, or
θ
= 0
.
2 and I
get the good bus, and so on. These 11 ways are all mutually exclusive. That is, only one
of them can be true (since
θ
is actually just a single number). Mathematically, we can
obtain the posterior probability of catching the good bus tomorrow using the sum rule:
P (good bus tomorrow|x) =
X
θ
p(θ|x)P (good bus tomorrow|θ, x) (4.4)
=
X
θ
p(θ|x)θ (4.5)
This says that the total probability for a good bus tomorrow (given the data, i.e. using
the posterior distribution and not the prior distribution) is given by going through each
possible
θ
value, working out the probability assuming the
θ
value you are considering
is true, multiplying by the probability (given the data) this
θ
value is actually true, and
summing. In this particular problem, because
P
(
good bus tomorrow|θ, x
) =
θ
, it just so
happens that the probability for tomorrow is the expectation value of
θ
using the posterior