Discussion: May 2nd
Deadline: May 1st, 20:00
In this assignment, we will be implementing a Binary RBM. This requires some low-level programming rather than just sticking a bunch of layers together.
For additional assistance, there is a starter notebook on Gitlab, as well as a summary of the exercise board on Mattermost.
Start off by implementing the RBM model and algorithm 18.1 from the deep learning book.
gibbs_update step. Recall
that RBMs allow for efficient Gibbs sampling by first updating all hidden
units at once, and then all visible ones. One Gibbs update means updating both
hidden and visible units!
h given v
as well as vice versa). Use the
conditional distributions from chapter 20.2.tfp.distributions.Bernoulli should be helpful. Docs can be found
here. Do not use other
submodules such as MCMC-related functions!v^T W h) requires some thought.h. The most straightforward way to get these is to
take a sample from p(h|v) where v is the real data.GradientTape to compute the gradient update by treating
“negative phase minus positive phase” as a loss function to be minimized.
However, the “ideal” value for this loss will actually be 0 (data and model
distributions are identical) despite the function in principle being able to
take on any value. If your loss decreases to arbitrarily large negative values
there is likely something wrong with your training (although it’s normal for
negative values to appear in the beginning).
GradientTape, this should not be an issue.Once you have the basic algorithm going, you might want to test it first. Since
we are working with binary RBMs, MNIST seems like the best option here. You may
“binarize” the data by rounding all values to 0 or 1, however since MNIST is
already almost binary this will likely not make a large difference
(still, it is “more correct” to do so). Experiment
with different numbers of hidden units and burn-in steps, and generate some
samples of the trained models for subjective evaluation.
Note: To get the best results, your Markov chains likely need to run for
a few hundred steps each training step! This draws out training quite a bit.
Next, you should improve on the basic procedure.
Algorithm 18.1 is rather slow due to the long burn-in time required at each gradient step. Implement contrastive divergence and/or persistent contrastive divergence (18.2/18.3) to alleviate this. This requires relatively minor changes.
Test your algorithms once again and compare the results (as well as the speed at which you achieve them) to the basic algorithm. You can probably cut the number of burn-in steps significantly (say, 5-10x fewer).
Feel free to try other training methods such as pseudolikelihood or score matching (from optional reading).