For an overview on the topic, start with the blog post “Attention and Memory in Deep Learning and NLP” by Denny Britz and the distill.pub article “Attention and Augmented Recurrent Neural Networks” by Olah & Carter.
Continue by reading the following articles that provide details on two very recently proposed approaches,
The blogpost The Illustrated Transformer by Jay Alammar.
Dehghani et al. “Universal Transformers” - This follow-up of the original transformer paper (see further reading below) is much more accessible and also incorporates the idea of adaptive compute time.