Assignment 4: Deep Belief Networks
Discussion: May 18th
Having successfully implemented RBMs, we will continue our homage to Deep
Learning history by implementing a Deep Belief Network.
Training DBNs
This is actually not too difficult if you have a proper RBM implementation.
You will simply need to iterate the RBM training procedure over the layers
(check the Deep Learning book, chapter 20.3): Start by training an RBM as usual
(since we are still looking at binary variables, you might want to stick with MNIST).
When this is finished, add a second hidden layer and train it, treating the
first hidden layer as “visible” in this case. When this is finished, add a third
hidden layer etc. Try going up to 1000 layers.
As always, there are a few caveats.
- You may be tempted to create three variables for each layer: Connection
weights, visible biases and hidden biases. However, the biases should actually
“propagate” in that the hidden biases of one iteration will serve as the
visible biases of the next (when you add the next hidden layer). That is, each
layer only has one bias in total.
- When training “higher” layers, you will need to run the input data (e.g. MNIST
images) through the already-trained layers in order to provide “visible” data
for the training procedure (using the conditional probability definitions of the
RBM). The question is whether to sample binary activations
at each hidden layer or to simply take the probabilities as they are. You can
try both; the latter option is likely more stable since just taking the
probabilities corresponds to taking the average over many sampling operations.
When all layers have finished training, you can try sampling from the DBN to see
the results. Again, see chapter 20.3 for the procedure. Basically, you just run
Gibbs sampling on the upper two layers and then run a single ancestral sampling
step down to the visible units.
Unsupervised Pretraining
Using generative models for sampling is nice and all, but there are many more
possible applications. A classical use case are classification tasks where
little labeled data is available, but unlabeled data is plentiful. We can
simulate such a situation by simply taking a subset of our labeled data and
pretend that the rest isn’t there (or rather, doesn’t have labels). Try this:
- Train a DBN on the full MNIST dataset.
- Take a random sample of the training data (maybe 1000 or even less).
- Stick a softmax classification layer on top of the last hidden layer of the
DBN. Train it only on the previously selected labeled subset. To do this, you
should run the DBN sequentially from visible to the top hidden layer to extract
“features” of the data and then run the softmax layer based on these features,
basically treating the DBN as an MLP with sigmoid activations. Check 20.3 again
for some thoughts on this. Do not train the hidden layers anymore, only the
softmax layer.
- Evaluate the model performance on the full test set.
- Train an MLP from the ground up (for fairness, use the same number of hidden
units and sigmoid activations), only on the labeled subset. Evaluate and compare test
set performance to the repurposed DBN. Which version performs better?