1. Transformers

1.1 Self-Attention

$$ MultiHeadAttention(Q,K,V)=Concat(Attention(QW^Q_i,KW^K_i,VW^V_i))W_0 $$

$$ Attention(Q, K, V)=softmax(\frac{Q\cdot K^T}{\sqrt{d_k}})V $$

1.2 Positional Encoding

Target: Take the order of words(together with the meaning of the words) into consideration

pos: the position of the words in the sentence;
i: the number of feature index of the position embeding;

pos: the position of the words in the sentence; i: the number of feature index of the position embeding;

x-axis: feature index;
y-axis: position of the words;
a line in the pic is a embedding of the number pos word’s emb;

x-axis: feature index; y-axis: position of the words; a line in the pic is a embedding of the number pos word’s emb;

1.3 Layer Normalization (Normalize each sample in a layer corss all features)

Batch and Layer Normalization | Pinecone

(1)Batch normalization normalizes each feature independently across the mini-batch. Layer normalization normalizes each of the inputs in the batch independently across all features.

(2)As batch normalization is dependent on batch size, it’s not effective for small batch sizes. Layer normalization is independent of the batch size, so it can be applied to batches with smaller sizes as well.

(3)Batch normalization requires different processing at training and inference times. As layer normalization is done along the length of input to a specific layer, the same set of operations can be used at both training and inference times.

1.4 Architecture