General

Reading Assignment: Attention and Memory

Overview

For an overview on the topic, start with the blog post “Attention? Attention!” by Lilian Weng and the distill.pub article “Attention and Augmented Recurrent Neural Networks” by Olah & Carter.

More on Transformers

For a detailed visual explanation of Transformers, visit the blog post “The Illustrated Transformer” by Jay Alammar.
Lilian Weng provides a very comprehensive overview on architectures derived from the original Transformer in the blog post “The Transformer Family”.
Alternatively, you can read Chapter 11.4 of the Book “Deep Learning with Python, Second Edition”.
Optionally, also have a look at Dehghani et al. “Universal Transformers” - This follow-up of the original transformer paper (see further reading below) is much more accessible and also incorporates the idea of adaptive compute time.
Optionally, if you are more into videos, there is a lecture by Alex Graves on “Attention and Memory in Deep Learning” that covers many of the topics from this week’s reading assignments.

More on neural networks with explicit memory (optional)

If you would like to dive more into this very exciting topic. have a look at the paper Graves et al. “Hybrid computing using a neural network with dynamic external memory” in Nature 538.7626 (2016): pp. 471-476.. There is also a video recording of Alex Graves’ talk at NIPS 2016.

Further Reading (Optional)

There are many exciting directions to explore further:

Follow the links from the blog post to the referenced articles for further reading.
The original transformer paper Ashish et al. “Attention Is All You Need” in NIPS (2017) (The version on arXiv seems to be most recent.)