The goal of Markov Chain Monte Carlo (MCMC) is to generate random samples from complicated high dimensional distributions about which we have incomplete information. For example, it might be that we don’t know the normalizing constant of the distribution, as we saw in the code breaking example of the previous section.
🎥 See More
Imagine a large state space . In our code-breaking example it is the space of all decoders, that is, all permutations of the alphabet.
Now suppose we are interested in a particular probability distribution on this space. We are going to call this probability distribution . In the code-breaking example, is the distribution of scores of the decoders. Remember that while we can find the score of any particular decoder, we can’t list them all, so we don’t have a computational formula for .
In the code-breaking setting, we are interested in the mode of . That is, we are looking for the decoder that has the highest score.
Since can’t compute , the idea is to see if we can instead simulate a random variable that has distribution .
That is what MCMC does. The procedure relies on a few observations.
Suppose we can create a Markov Chain that has our specified distribution as its stationary distribution. Then for large the distribution of will be approximately .
Suppose we can construct such a chain. By the definition of a mode, is most likely to be at or near the mode of , when is large. So by running the chain for a long time we will be able to identify the mode of .
Creating a chain involves creating a transition matrix. To create a transition matrix that results in as the stationary distribution, the easiest way is to try to ensure that the detailed balance equations are solved.
The detailed balance equations are equivalent to
The right hand side only involves the transition probabilities of the chain that we want to create. The left hand side only involves ratios of the terms in , and therefore can be checked even if we don’t know the constant that normalizes .
11.3.1Metropolis Algorithm¶
Exactly who proposed the first algorithm to create such a Markov Chain is the subject of some debate. A general version was proposed by Hastings. Here we will describe an earlier version attributed to Metropolis and co-authors in 1953.
The goal is to create a transition matrix so that and together solve the detailed balance equations.
The algorithm starts with any symmetric, irreducible transition matrix on the state space. For example, if the state space is numerical you could start with, “Wherever the chain is, it picks one of the three closest values (including itself) with probability each.” For a pair of states and , the transition probability is called the proposal probability.
The algorithm then introduces additional randomization to create a new chain that is irreducible and aperiodic and has as its stationary distribution.
Here are the rules that determine the transitions of the new chain.
Suppose the chain is at at time , that is, suppose . Pick a state according to the proposal probability . This is the candidate state to which your chain might move.
Define the acceptance ratio
If , set .
If , toss a coin that lands heads with chance .
If the coin lands heads, set .
If the coin lands tails, set .
Repeat all the steps, with as the starting value.
Thus the new chain either moves to the state picked according to , or it stays where it is. We say that it accepts a move to a new state based on and , and otherwise it doesn’t move.
The new chain is irreducible because the proposal chain is irreducible. It is aperiodic because it can stay in place. So it has a steady state distribution.
The algorithm says that this steady state distribution is the same as the distribution that was used to define the ratios .
🎥 See More
11.3.2How to Think About the Algorithm¶
Before we prove that the algorithm works, let’s examine what it is doing in the context of decoders.
First notice that we are requiring to be symmetric as well as irreducible. The symmetry requirement makes sense as each detailed balance equation involves transitions as well as .
Fix any starting decoder and call it . Now you have to decide where the chain is going to move next, that is, what the next decoder is going to be. The algorithm starts this process off by picking a decoder according to . We say that proposes a move to .
To decide whether or not the chain should move to , remember that the distribution contains the likelihoods of all the decoders. You want to end up with decoders that have high likelihood, so it is natural to compare and .
The algorithm does this by comparing the acceptance ratio to 1.
If , the likelihood of is at least as large that of , so you accept the proposal and move to .
If , the proposed decoder has less likelihood than the current , so it is tempting to stay at . But this risks the chain getting stuck at a local maximum. The algorithm provides a chance to avoid this, by tossing a biased coin. If the coin lands heads, the chain moves to even though has a lower likelihood than the current decoder . The idea is that from this new position there might be paths to decoders that have the highest likelihoods of all.
11.3.3The Algorithm Works¶
We will now show that the detailed balance equations are solved by the desired limit distribution and the transition matrix of the chain created by the Metropolis algorithm.
Take any two distinct states and .
11.3.3.1Case 1: ¶
Then . By the algorithm, and also by the symmetry of .
Therefore and the detailed balance equation is satisfied.
11.3.3.2Case 2: ¶
Then , so
Now , so the algorithm says .
Therefore
which is the same as
11.3.3.3Case 2: ¶
Reverse the roles of and in Case 2.
That’s it! A simple and brilliant idea that provides a solution to a difficult problem. In lab, you will see it in action when you implement the algorithm to decode text.