For an overview on the topic, start with the blog post “Attention? Attention!” by Lilian Weng and the distill.pub article “Attention and Augmented Recurrent Neural Networks” by Olah & Carter.
For a detailed visual explanation of Transformers, visit the blog post “The Illustrated Transformer” by Jay Alammar.
Lilian Weng provides a very comprehensive overview on architectures derived from the original Transformer in the blog post “The Transformer Family”.
Alternatively, you can read Chapter 11.4 of the Book “Deep Learning with Python, Second Edition”.
Optionally, also have a look at Dehghani et al. “Universal Transformers” - This follow-up of the original transformer paper (see further reading below) is much more accessible and also incorporates the idea of adaptive compute time.
Optionally, if you are more into videos, there is a lecture by Alex Graves on “Attention and Memory in Deep Learning” that covers many of the topics from this week’s reading assignments.
If you would like to dive more into this very exciting topic. have a look at the paper Graves et al. “Hybrid computing using a neural network with dynamic external memory” in Nature 538.7626 (2016): pp. 471-476.. There is also a video recording of Alex Graves’ talk at NIPS 2016.
There are many exciting directions to explore further: