Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Let XX be a random variable with a beta density. Given X=pX=p, toss a pp-coin nn times and observe the number of heads. Based on the number of heads, we are going to:

  • Identify the posterior distribution of XX

  • Predict the chance of heads on the (n+1)(n+1)st toss

21.1.1Beta Prior

For positive integers rr and ss, we derived the beta (r,s)(r, s) density

f(x)=(r+s1)!(r1)!(s1)!xr1(1x)s1,   0<x<1f(x) = \frac{(r+s-1)!}{(r-1)!(s-1)!} x^{r-1}(1-x)^{s-1}, ~~~ 0 < x < 1

by studying order statistics of i.i.d. uniform (0,1)(0, 1) random variables. The beta family can be extended to include parameters rr and ss that are positive but not integers. This is possible because of two facts that you have observed in exercises:

  • The Gamma function is a continuous extension of the factorial function.

  • If rr is a positive integer then Γ(r)=(r1)!\Gamma(r) = (r-1)!.

For fixed positive numbers rr and ss, not necessarily integers, the beta (r,s)(r, s) density is defined by

f(x)=Γ(r+s)Γ(r)Γ(s)xr1(1x)s1,   0<x<1f(x) = \frac{\Gamma(r+s)}{\Gamma(r)\Gamma(s)} x^{r-1}(1-x)^{s-1}, ~~~ 0 < x < 1

We will not prove that this function integrates to 1, but it is true and should be believable because we have seen it to be true for integer values of the parameters.

To simplify notation, we will denote the constant in the beta (r,s)(r, s) density by C(r,s)C(r, s).

C(r,s) = Γ(r+s)Γ(r)Γ(s)C(r, s) ~ = ~ \frac{\Gamma(r+s)}{\Gamma(r)\Gamma(s)}

so that the beta (r,s)(r, s) density is given by C(r,s)xr1(1x)s1C(r, s)x^{r-1}(1-x)^{s-1} for x(0,1)x \in (0, 1).

Beta distributions are often used to model random proportions. In the previous chapter you saw the beta (1,1)(1, 1) distribution, better known as the uniform, used in this way to model a randomly picked coin.

You also saw that given that we know the value of pp for the coin we are tossing, the tosses are independent, but when we don’t know pp then the tosses are no longer independent. For example, knowledge of how the first toss came out tells us something about pp, which in turn affects the probabilities of how the second toss might come out.

We will now extend these results by starting with a general beta (r,s)(r, s) prior for the chance that the coin lands heads.

21.1.2The Experiment

Let XX have the beta (r,s)(r, s) distribution. This is the prior distribution of XX. Denote the prior density by fXf_X. Then

fX(p) = C(r,s)pr1(1p)s1,    0<p<1f_X(p) ~ = ~ C(r, s)p^{r-1}(1-p)^{s-1}, ~~~~ 0 < p < 1

Given X=pX = p, let I1,I2,I_1, I_2, \ldots be i.i.d. Bernoulli (p)(p). That is, given X=pX = p, toss a pp-coin repeatedly and record the results as I1,I2,I_1, I_2, \ldots.

Let Sn=I1+I2++InS_n = I_1 + I_2 + \cdots + I_n be the number of heads in the first nn tosses. Then the conditional distribution of SnS_n given X=pX = p is binomial (n,p)(n, p). It gives you the likelihood of the observed number of heads given a value of pp.

🎥 See More
Loading...

21.1.3Updating: The Posterior Distribution of XX Given SnS_n

Before running the experiment, our prior opinion is that XX has the beta (r,s)(r, s) distribution. To update that opinion after we have tossed nn times and seen the number of heads, we have to find the posterior distribution of XX given Sn=kS_n = k.

As we have seen, the posterior density is proportional to the prior times the likelihood. For 0<p<10 < p < 1,

fXSn=k(p)  C(r,s)pr1(1p)s1(nk)pk(1p)nk pr+k1(1p)s+(nk)1\begin{align*} f_{X \vert S_n=k} (p) ~ &\propto ~ {C(r, s) p^{r-1}(1-p)^{s-1} \cdot \binom{n}{k} p^k (1-p)^{n-k}}\\ \\ &\propto ~ p^{r+k-1}(1-p)^{s + (n-k) - 1} \end{align*}

because C(r,s)C(r, s) and (nk)\binom{n}{k} do not involve pp.

You can see at once that this is the beta (r+k,s+nk)(r+k, s+n-k) density:

fXSn=k(p) = C(r+k,s+nk)pr+k1(1p)s+nk1,   0<p<1f_{X \mid S_n = k} (p) ~ = ~ C(r+k, s+n-k) p^{r+k-1}(1-p)^{s + n - k - 1}, ~~~ 0 < p < 1

This beta posterior density is easy to remember. Start with the prior; update the first parameter by adding the observed number of heads; update the second parameter by adding the observed number of tails.

21.1.4Conjugate Prior

The prior distribution of the probability of heads is from the beta family. The posterior distribution of the probability of heads, given the number of heads, is another beta density. The beta prior and binomial likelihood combine to result in a beta posterior. The beta family is therefore called a family of conjugate priors for the binomial distribution: the posterior is another member of the same family as the prior.

21.1.5MAP Estimate: Posterior Mode

The MAP estimate of the chance of heads is the mode of the posterior distribution. If r+kr+k and s+nks+n-k are both greater than 1 then the mode of the posterior distribution of XX is

r+k1r+s+n2\frac{r+k-1}{r+s+n-2}
🎥 See More
Loading...

21.1.6Posterior Mean

The posterior mean of XX given Sn=kS_n = k is the expectation of the beta posterior distribution, which for large nn is not far from the mode:

E(XSn=k) = r+kr+s+nE(X \mid S_n = k) ~ = ~ \frac{r+k}{r+s+n}

Let’s examine this result in an example. Suppose the prior distribution of XX is beta (5,3)(5, 3), and thus the prior mean is E(X)=5/8=0.625E(X) = 5/8 = 0.625. Now suppose we are given that S100=70S_{100} = 70. Then the posterior distribution of XX given S100=70S_{100} = 70 is beta (75,33)(75, 33) with mean 75/108=0.69475/108 = 0.694.

The graph below shows the two densities along with the corresponding means. The red dot is at the observed proportion of heads.

Run the cell again, keeping r=5r = 5 and s=3s = 3 but changing nn to 10 and kk to 7, then again changing nn to 1000 and kk to 700. The observed proportion is 0.7 in all cases. Notice how increasing the sample size concentrates the prior around 0.7. We will soon see the reason for this.

Also try other values of the parameters as well as nn and kk, including values where the observed proportion is quite different from the mean of the prior.

# Prior: beta (r, s)
# Given: S_n = k

# Change the values
r = 5
s = 3
n = 100
k = 70

# Leave this line alone
plot_prior_and_posterior(r, s, n, k)
<matplotlib.figure.Figure at 0x1a18a964e0>

You can see how the data dominate the prior. The posterior distribution is concentrated around the posterior mean. The prior mean was 0.625, but given that we got 70 heads in 100 tosses, the posterior mean is 0.694 which is very close to the observerd proportion 0.7.

The formula for the posterior mean shows that for large nn it is likely to be close to the observed proportion of heads. Given Sn=kS_n = k, the posterior mean is

E(XSn=k) = r+kr+s+nE(X \mid S_n = k) ~ = ~ \frac{r + k}{r + s + n}

Therefore as a random variable, the posterior mean is

E(XSn) = r+Snr+s+nE(X \mid S_n) ~ = ~ \frac{r + S_n}{r + s + n}

As the number of tosses nn gets large, the number of heads SnS_n is likely to get large too. So the value of SnS_n is likely to dominate the numerator, and nn will dominate the denominator, because rr and ss are constants. Thus for large nn, the posterior mean is likely to be close to Sn/nS_n/n.

21.1.7Prediction: The Distribution of Sn+1S_{n+1} Given SnS_n

As you saw in the previous chapter, the chance that a random coin lands heads is the expected value of its random probability of heads. Apply this to our current setting to see that

P(S1=1) = P(first toss is a head) = E(X) = rr+sP(S_1 = 1) ~ = ~ P(\text{first toss is a head}) ~ = ~ E(X) ~ = ~ \frac{r}{r+s}

Now suppose that we have the results of the first nn tosses, and that kk of those tosses were heads. Given that Sn=kS_n = k, the possible values of Sn+1S_{n+1} are kk and k+1k+1. We can now use our updated distribution of XX and the same reasoning as above to see that

P(Sn+1=k+1Sn=k) = P(toss n+1 is a headSn=k) = E(XSn=k) = r+kr+s+nP(S_{n+1} = k+1 \mid S_n = k) ~ = ~ P(\text{toss } n+1 \text{ is a head} \mid S_n = k) ~ = ~ E(X \mid S_n = k) ~ = ~ \frac{r+k}{r + s + n}

We can work out P(Sn+1=kSn=k)P(S_{n+1} = k \mid S_n = k) by the complement rule. We now have a transition function. Given that Sn=kS_n = k, the conditional distribution of Sn+1S_{n+1} is given by

Sn+1={k         with probability (s+nk)/(r+s+n)k+1   with probability (r+k)/(r+s+n)S_{n+1} = \begin{cases} k ~~~~~~~~ \text{ with probability } (s + n - k)/(r + s + n) \\ k+1 ~~ \text{ with probability } (r+k)/(r + s + n) \end{cases}

In other words, given the results of the first nn tosses, the chance that Toss n+1n+1 is a tail is proportional to ss plus the number of failures. The chance that Toss n+1n+1 is a head is proportional to rr plus the number of successes.

You can think of the sequence {Sn:n1}\{ S_n: n \ge 1 \} as a Markov chain, but keep in mind that the transition probabilities are not time-homogenous – the formulas involve nn.