Assignment 2: Let’s go to Monte Carlo

Discussion: May 4th

In this assignment, we will be investigating Monte Carlo Methods with a few simple examples. There is a lot of text for explanation, but the actual tasks are rather compact.

Markov Chain Monte Carlo

To understand the basics of MCMC methods, we consider the simple example in section 17.3 of the deep learning book, where we are interested in sampling single integers x from 0, …, n according to some distribution. Try the following:

Define a transition matrix A as in the book. This should be a stochastic matrix, i.e. each column must sum to 1. You can choose n as you wish; it’s okay to stick with a small number (say, n=5). You can create the matrix from random values, or write it down by hand.
Create an initial distribution v(0), i.e. an n-element vector with elements summing to 1.
Repeatedly multiply the matrix A with the vector. The result should converge very quickly (just a handful of steps) to a vector v’. This vector represents the probability distribution that this Markov chain (represented by matrix A) will converge to!

Now, we run a Markov chain:

Start off with an arbitrary state x(0).
Repeatedly take the column from A that corresponds to the current state x(i) (i.e. if the current state is 3, you take the 3rd column). Sample a new state x(i+1) from this probability distribution. Do this many times and collect all samples in a list or something similar.
At the end, plot a histogram of your samples. This gives you an empirical distribution over the samples. Does this distribution match the vector v’ computed above? If not, you might need to run the chain for more steps. If yes, congratulations! By running your chain you are sampling from the distribution represented by A.
Note that you can use tf.random.categorical for sampling; however this only takes logits of a distribution. In this case, you might want to define A in terms of logits in the first place, and apply softmax (per column!!) to get a regular probability matrix.

Gibbs Sampling & Mixing

In most of our use cases, the situation is not as in the example above: We don’t have a transition distribution given and just start running a Markov chain on it. Instead, we have a desired target distribution and need to figure out how to get there.

Let’s try to sample from a mixture of Gaussians via Gibbs sampling.

Set up a distribution. In order to be able to do any conditioning, this needs to be multivariate. For simplicity (and easy plotting), stick to a 2D distribution. Also, we want at least two components; these can be simple independent Gaussians.
Start with an arbitrary initial sample (e.g. a vector of 0s). Now, repeatedly do the following: Sample a new value for x given the value for y. Then, sample a new value for y given the (new!) value for x. Since we now have sampled new values for both dimensions, we have essentially taken a new sample of our 2D distribution.
You can use tensorflow-probability (in particular, the distributions module) to build the distributions.
The hardest part is to figure out how to take a conditional sample. It turns out that in this case, where p(x,y) is a mixture of Gaussians, the conditional distribution p(x|y) (as well as the other way around) is a mixture of Gaussians as well! You can find a derivation here. Being able to translate such formulae into code is important! By using independent Gaussians as components, the only thing that actually changes in the conditional distribution are the mixture coefficients!

You should collect a reasonable number of samples (1000 or more) and plot both the target distribution (mixture of Gaussians) as well as your samples. Do the samples reflect the distribution well? In particular are both modes of the Gaussian mixture covered equally? You can do this visually and/or using statistics. Also, experiment with different locations/scales for the Gaussians. That is, move the components further apart or closer together and repeat the sampling process each time. The quality of the samples should vary dramatically based on the distance between components!

Importance Sampling

Finally, let’s give importance sampling a shot. Say we are once again interested in sampling from a mixture of Gaussians as above, but cannot do so directly. Instead, we can only sample from a simple Gaussian distribution with mean 0 and standard deviation 1. Let’s try to estimate the average norm of a sample via importance sampling. In the language of section 17.2:

p is the mixture of Gaussians
q is the simple Gaussian
f is the (Euclidean) norm

Take a bunch of samples from q and compute the Monte Carlo estimate (i.e. sample average) as in equation 17.10. You may want to plot how this evolves over time (i.e. with the number of samples). Does it converge? Next, also take samples from the mixture of Gaussians directly (the thing we pretended we couldn’t do) and compute the average norm on those samples. Compare with the importance sampling estimate. Do they converge to the same point? Does this depend on the locations/scales of the mixture components? Experiment with different values!