Attention is all you need

Attention 101

Let's start to take a look at what is attention, and how to calculate attention.

What is attention

Basically attention is a technique that allows a model to assign importance scores to different element of its input. This allows the model to focus on most relevant information for the task.

How it works

Attention works with three key components: Query, Key and Value

The output of attention is computed as a weighted sum of the values, where the weight is a function of query and corresponding key:

attention = softmax(Query * Key) * Value

Basically the weight is the softmax of dot product between Query and Key. Dot product calculate the similarity between Query and Key. Softmax transforms the similarity score into a probability between 0 and 1.

Walk through a toy example

Let's walk through a toy example about how to calculate the basic attention with text as input (Fig. 1):

Figure 1 A toy example to calculate basic attention

Positional encoding & Scaled attention

There are a few improvement that the Attention is all you need work has contributed to the attention mechanism. We will talk about positional encoding and scaled attention in this section. Both of them are highlighted as red in Fig. 2.

Positional encoding

Unlike RNN and CNN that takes order of the sequence into consideration, the transformer architecture doesn't have the positional information by nature. However, position information are usually important to the tasks. To make sure the information is available for transformer, the authors proposed the "positional encoding" idea:

Scaled attention

The authors added a scaling factor 1/sqrt(d) to the dot-product attention to avoid vanishing gradient when the dot product grows in magnitude with the large d.


Figure 2 Improvement on basic attention: positional encoding and scaled attention.

Multihead attention

Why multi-head attention

Walk through a toy example with 4 heads

Figure 3 A toy example of multihead attention with positional encoding and scaled attention.

Encoder/Decoder: Attention is not all you need

Encoder

The encoder is composed of N (N=6) identical layers

Basic block for encoder

The basic building block for encoder/decoder are composed of

Decoder

The decoder is composed of N (N=6) identical layers

Basic blocks for decoder

The basic building block for encoder/decoder are composed of

Figure 4 The transformer model architecture