A well-known explanation of the success of LayerNorm is its re-centering and re-scaling invariance property.

Re-centering enables the model to be insensitive to shift noises on both inputs and weights, Re-scaling keeps the output representations intact when both inputs and weights are randomly scaled.

In this paper, we hypothesize that the re-scaling invariance is the reason for success of LayerNorm, rather than re-centering invariance.

We propose RMSNorm which only focuses on re-scaling invariance and regularizes the summed inputs simply according to the root mean square (RMS) statistic

BatchNorm、LayerNorm、RMSNorm 是深度学习中常用的归一化方法,它们在作用、常用场景、公式、优缺点和设计目标上既有区别又有联系。

  1. BatchNorm(批归一化)
  2. LayerNorm(层归一化)
  3. RMSNorm(均方根归一化)