- General

**Deadline: November 29th, 9am**

Building on the last assignment, this time we want to iron out some of the issues that were left. In particular:

- less wasteful padding,
- using embeddings to avoid one-hot vectors,
- avoiding computations on padded time steps,
- using Keras to simplify our code,
- using more advanced architectures such as LSTMs, stacked RNNs or bidirectional RNNs.

**The notebook associated with the practical exercise can be found
here.**

In the last assignment, we padded all sequences to the longest one in the dataset
because we need “rectangular” input tensors.
However, at the end of the day, only each *batch* of inputs needs to be the same
length. If the longest sequence in a batch has length 150, the other sequences
need only be padded to that length, not the longest in the whole dataset!

Thus, if we can find a way to delay the padding after we have formed batches, we can
gain some efficiency. Unfortunately, we cannot even create a `tf.data.Dataset.from_tensor_slices`

to apply batching to!

Luckily, there are other ways to create datasets. We will be using `from_generator`

,
which allows for creating datasets from elements returned by arbitrary Python
generators. Even better, there is also a `padded_batch`

transformation function
which batches inputs *and* pads them to the longest length in the batch (what
would happen if we tried the regular `batch`

method?). See the notebook for a
usage example!

**Note:** Tensorflow also has `RaggedTensor`

. These are special tensors allowing
different shapes per element. You can find a guide
here. You could directly
create a dataset `from_tensor_slices`

by supplying a ragged tensor, which is
arguably easier than using a generator. Unfortunately, ragged tensors are not
supported by `padded_batch`

. Sad!

**However**, many tensorflow operations support ragged tensors, so padding can
become unnecessary in many places! You can check the guide for an example with a
Keras model. You can try this approach if you want, but for the rest of the
assignment we will continue with the padded batches (the ragged version will likely
be very slow).

There is still a problem with the above approach. In our dataset, there are many short sequences and few very long ones. Unfortunately, it is very likely that all (or most) batches contain at least one rather long sequence. That means that all the other (short) sequences have to be padded to the long one! Thus, in the worst case, our per-batch padding might only gain us very little. It would be great if there was a way to sort the data such that only sequences of a similar length are grouped in a batch… Maybe there is something in the notebook?

**Note:** If you truncated sequences to a relatively small value, like 200, bucketing
may provide little benefit. The reason is that there will be so many sequences
at the exact length 200 that the majority of batches will belong to this bucket.
However, if you decide to allow a larger value, say length 500, bucketing should
become more and more effective (noticeable via shorter time spent per batch).

Previously, we represented words by one-hot vectors. This is wasteful in terms
of memory, and also the matrix products in our deep models will be very
inefficient. It turns out, multiplying a matrix with a one-hot vector simply
*extracts the corresponding column from the matrix*.

Keras offers an `Embedding`

layer for an efficient implementation. Use this
instead of the one-hot operation! Note that the layer adds additional parameters,
however it can actually result in *fewer* parameters overall if you choose a small
enough embedding size (recall the lecture discussion on using linear hidden
layers).

Keras offers various RNN layers. These layers take an entire 3d batch of inputs
(`batch x time x features`

) and return either a complete output sequence, or only
the final output time step. There are two ways to use RNNs:

- The more general is to define a
*cell*which implements the per-step computations, i.e. how to compute a new state given a previous state and current input. There are pre-built cells for simple RNNs, GRUs and LSTMs (`LSTMCell`

etc.). The cells are then put into the`RNN`

layer which wraps them in a loop over time. - There also complete classes like
`LSTM`

which already wrap the corresponding cell.

While the first approach gives more flexibility (we could define our own cells),
it is *highly* recommended that you stick with the second approach, as this
provides highly optimized implementations for common usage scenarios. Check the
docs for the conditions under which these implementations can be used!

Once you have an RNN layer, you can use ist just like other layers, e.g. in
a sequential model. Maybe you have an embedding layer, followed by an LSTM,
followed by a dense layer that produces the final output. Now, you can easily
create stacked RNNs (just put multiple RNN layers one after the other), use
`Bidirectional`

RNNs, etc. Also try LSTMs vs GRUs!

One method to prevent new states being computed on padded time steps is by
using a *mask*. A mask is a binary tensor with shape `(batch x time)`

with 1s
representing “real” time steps and 0s representing padding. Given such a mask,
the state computation can be “corrected” like this:

`new_state = mask_at_t * new_state + (1 - mask_at_t) * old_state`

Where the mask is 1, the new state will be used. Where it is 0, the old state will be propagated instead!

Masking with Keras is almost too simple: Pass the argument `mask_zero=True`

to
your embedding layer (the constructor, not the call)! You can read more about
masking here. The
short version is that tensors can carry a mask as an attribute, and Keras
layers can be configured to use and/or modify these masks in some way. Here,
the embedding layer “knows” to create a mask such that 0 inputs (remember that index
0 encodes padding) are masked as `False`

, and the RNN layers are implemtend to
perform something like the formula above.

Add masking to your model! The result should be much faster learning
(in terms of steps needed to reach a particular performance, not time),
in particular
with `post`

padding (the only kind of padding supported by `padded_batch`

). The
effect will be more dramatic the longer your sequences are.

Implement the various improvements outlined in this assignment. Experiment with adding them one by one and judge the impact (on accuracy, training time, convenience…) of each. You can also carry out “ablation” studies where you take the full model with all improvements, and remove them one at a time to judge their impact.

You can also try using higher or smaller vocabulary sizes and maximum sequence lengths and investigate the impact of these parameters!

If for some reason you are not using Keras RNN layers, but rather your own loops
over time, there are a few more things to be aware of when using `tf.function`

:

- There seems to be an issue related to data shapes when using
`bucket_by_sequence_length`

and the final batch in the dataset (which can be smaller than the others). If you receive strange errors about unknown data shapes, you can set`drop_remainder=True`

, or use regular`padded_batch`

instead of bucketing. - A
`tf.function`

is re-compiled every time it receives an input with a different “signature”. This is defined as the shape and data type of the tensor. When every batch has a different sequence length, this causes the training loop to be re-compiled every step. You can fix this by supplying an`input_signature`

to`tf.function`

– please check the API docs. You can also pass`experimental_relax_shapes=True`

instead, although this seems to be a little less effective.