In AI, attention mechanisms based on the transformer architecture are everywhere: from language models through musical composition (listen to OpenAI’s MuseNet) or game-playing systems like AlphaStar.

Although transformers are a big improvement over previously used recurrent neural networks, they still struggle with long-range dependencies. Ideally, you want your system to take into account that a fact from page 1 in a novel is relevant to the whodunnit moment.

Now, the context size over which attention can be applied is increasing:

GPT-2: 1,000

MuseNet: 4,000

Sparse Transformers: Up to 65,000

Facebook paper: 30 billion

We’ll see what the actual effects will be. I remember big excitement about the Outrageously Large Neural Networks paper but, to my knowledge, it hasn’t had nearly as much uptake as expected.


Generative Modeling with Sparse Transformers

Outrageously Large Neural Networks



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s