Assignment 4: Deep Belief Networks

Discussion: May 18th

Having successfully implemented RBMs, we will continue our homage to Deep Learning history by implementing a Deep Belief Network.

Training DBNs

This is actually not too difficult if you have a proper RBM implementation. You will simply need to iterate the RBM training procedure over the layers (check the Deep Learning book, chapter 20.3): Start by training an RBM as usual (since we are still looking at binary variables, you might want to stick with MNIST). When this is finished, add a second hidden layer and train it, treating the first hidden layer as “visible” in this case. When this is finished, add a third hidden layer etc. Try going up to 1000 layers.

As always, there are a few caveats.

You may be tempted to create three variables for each layer: Connection weights, visible biases and hidden biases. However, the biases should actually “propagate” in that the hidden biases of one iteration will serve as the visible biases of the next (when you add the next hidden layer). That is, each layer only has one bias in total.
When training “higher” layers, you will need to run the input data (e.g. MNIST images) through the already-trained layers in order to provide “visible” data for the training procedure (using the conditional probability definitions of the RBM). The question is whether to sample binary activations at each hidden layer or to simply take the probabilities as they are. You can try both; the latter option is likely more stable since just taking the probabilities corresponds to taking the average over many sampling operations.

When all layers have finished training, you can try sampling from the DBN to see the results. Again, see chapter 20.3 for the procedure. Basically, you just run Gibbs sampling on the upper two layers and then run a single ancestral sampling step down to the visible units.

Unsupervised Pretraining

Using generative models for sampling is nice and all, but there are many more possible applications. A classical use case are classification tasks where little labeled data is available, but unlabeled data is plentiful. We can simulate such a situation by simply taking a subset of our labeled data and pretend that the rest isn’t there (or rather, doesn’t have labels). Try this:

Train a DBN on the full MNIST dataset.
Take a random sample of the training data (maybe 1000 or even less).
Stick a softmax classification layer on top of the last hidden layer of the DBN. Train it only on the previously selected labeled subset. To do this, you should run the DBN sequentially from visible to the top hidden layer to extract “features” of the data and then run the softmax layer based on these features, basically treating the DBN as an MLP with sigmoid activations. Check 20.3 again for some thoughts on this. Do not train the hidden layers anymore, only the softmax layer.
Evaluate the model performance on the full test set.
Train an MLP from the ground up (for fairness, use the same number of hidden units and sigmoid activations), only on the labeled subset. Evaluate and compare test set performance to the repurposed DBN. Which version performs better?