In AI, attention mechanisms based on the transformer architecture are everywhere: from language models through musical composition (listen to OpenAI’s MuseNet) or game-playing systems like AlphaStar.
Although transformers are a big improvement over previously used recurrent neural networks, they still struggle with long-range dependencies. Ideally, you want your system to take into account that a fact from page 1 in a novel is relevant to the whodunnit moment.
Now, the context size over which attention can be applied is increasing:
Sparse Transformers: Up to 65,000
Facebook paper: 30 billion
We’ll see what the actual effects will be. I remember big excitement about the Outrageously Large Neural Networks paper but, to my knowledge, it hasn’t had nearly as much uptake as expected.