Assignment 7: Attention-based Neural Machine Translation
Deadline: November 29th, 9am
Note: This assignment was changed from last year, moving from an encoder-decoder
with attention to a Transformer architecture. It’s the first time we try it and
a bit experimental.
In this task, you will implement a model for neural machine translation using
a Transformer.
We will follow a TF tutorial
for this purpose.
Do not just run the code and call it a day. To make sure you understand what is going
on, you need to answer some questions, posted further below!
Notes on the Tutorial
READ THIS
- You can copy and paste the code from the tutorial into your own notebook, or
use the “Run in Google Colab” at the top.
- If you go for option 2, make sure you “Save a copy in Drive” in the File menu,
else your notebook will not be saved. Also, after saving a copy, remove the tick
for “Omit cell outputs when saving this notebook” in the settings!
- Also, please do not submit the full notebook with all the architecture images.
This just bloats the submission unnecessarily! Remove the cells with images before
submission!
- Do not run the code cell with
pip install
in the Setup section! This might
mess up your environment! The only
library you need to install is !pip install tensorflow-text==2.14.0
!
- If you don’t install the correct version of tf-text, it will install TF 2.15 instead,
which for whatever reason is MUCH slower (were talking 30x or so). Training will
take forever!
Questions for Understanding
Here are a few questions for you to check how well you understood the tutorial.
Please answer them (briefly) in your solution!
- Which parts of the sentence are used as tokens? Characters? Word?
Or something else (if so, what)?
- Do the same tokens in different language have the same ID? For example,
would the same token index map to the German word
die
and to the
English word die
?
- Why does the Transformer require positional encodings? Optional: What do you
think is the point of using sine waves of different frequencies?
- Yes or No: At each position, the decoder is attending to all previous positions,
i.e. all encoder positions and the previous decoder predictions.
- The decoder uses teacher forcing. Does this mean the time steps can be computed
in parallel?
- Can the encoder time steps be computed in parallel?
- Why is a mask applied to the loss function?
- Think about the loss function and metrics used (cross-entropy & accuracy). What
exactly do these measure? Do you think this is a good measure of translation quality?
If not, why not?
- When translating the same sentence multiple times, do you get the same result?
Why (not)? If not, what changes need to be made to get the same result each time?
- Which components would we need to switch out/change if we wanted to use a different pair
of languages?
Hand in all of your code, i.e. the working tutorial code along with all changes/additions you made. Include outputs which document some of your experiments. Also
remember to answer the questions above! Of course you can also write about other
observations you made.