Assignment 2: Let’s go to Monte Carlo

Discussion: April 25th
Deadline: April 24th, 20:00

NOTE THE CHANGED DEADLINE

In this assignment, we will be investigating Monte Carlo Methods with a few simple examples. There is a lot of text for explanation, but the actual tasks are rather compact.

Some starter code can be found on Gitlab!

Markov Chain Monte Carlo

To understand the basics of MCMC methods, we consider the simple example in section 17.3 of the deep learning book, where we are interested in sampling single integers x from 0, …, n according to some distribution. Try the following:

Define a transition matrix A as in the book. This should be a stochastic matrix, i.e. each column must sum to 1. You can choose n as you wish; it’s okay to stick with a small number (say, n=5). You can create the matrix from random values, or write it down by hand. For theoretical reasons, it’s best to have no 0s in the matrix.
Create an arbitrary initial distribution v(0), i.e. an n-element vector with elements summing to 1.
Repeatedly multiply the matrix A with the vector. The result should converge very quickly (just a handful of steps) to a vector v’. This vector represents the probability distribution that this Markov chain (represented by matrix A) will converge to!
Try different starting values of v(0). How does this influence v’ ?

Now, we run a Markov chain:

Start off with an arbitrary state x(0).
Repeatedly take the column from A that corresponds to the current state x(i) (i.e. if the current state is 3, you take the 3rd column). Sample a new state x(i+1) from this probability distribution. Do this many times (e.g. 1000s of times) and collect all samples in a list or something similar.
At the end, plot a histogram of your samples. This gives you an empirical distribution over the samples. Does this distribution match the vector v’ computed above? If not, you might need to run the chain for more steps. If yes, congratulations! By running your chain you are sampling from the distribution represented by A.
Note that you can use tf.random.categorical for sampling; however this only takes logits of a distribution. In this case, you might want to define A in terms of logits in the first place, and apply softmax (per column!!) to get a regular probability matrix. Alternatively, tensorflow-probability supports categorical distributions based on probabilities as well as logits.

Gibbs Sampling & Mixing

In most of our use cases, the situation is not as in the example above: We don’t have a transition distribution given and just start running a Markov chain on it. Instead, we have a desired target distribution (our model distribution) and need to figure out how to get there, i.e. how to sample from it.

Let’s try to sample from a mixture of Gaussians via Gibbs sampling.

Set up a distribution. In order to be able to do any conditioning, this needs to be multivariate. For simplicity (and easy plotting), stick to a 2D distribution. Also, we want at least two components; these can be simple independent Gaussians.
Start with an arbitrary initial sample (e.g. a vector of 0s). Now, repeatedly do the following: Sample a new value for x given the value for y. Then, sample a new value for y given the (new!) value for x. Since we now have sampled new values for both dimensions, we have essentially taken a new sample of our 2D distribution.
You can use tensorflow-probability (in particular, the distributions module) to build the distributions. There are tutorials on the Tensorflow website. You will likely want to use MixtureSameFamily of Normal distributions.
The hardest part is to figure out how to take a conditional sample. It turns out that in this case, where p(x,y) is a mixture of Gaussians, the conditional distribution p(x|y) (as well as the other way around) is a mixture of Gaussians as well! You can find a derivation here. Being able to translate such formulae into code is important! By using independent Gaussians as components, the only thing that actually changes in the conditional distribution are the mixture coefficients!
You can use tfp to sample from the conditional, one-dimensional, mixtures of Gaussians as mentioned above. You might say: Why not just sample from a two-dimensional mixture of Gaussians in the first place? The reason we don’t do this is that this is a contrived example to show the concept of Gibbs sampling. ;)
Please note the file assignment02_starter.ipynb uploaded to the course Gitlab – this can help with some of the above issues.

You should collect a reasonable number of samples (1000 or more) and plot both the target distribution (mixture of Gaussians) and your samples. Do the samples reflect the distribution well? In particular are both modes of the Gaussian mixture covered equally? You can do this visually and/or using statistics. Also, experiment with different locations/scales for the Gaussians. That is, move the components further apart or closer together and repeat the sampling process each time. The quality of the samples should vary dramatically based on the distance between components!