NormFormer
NormFormer is a family of transformer variants that seeks to improve training stability and performance by rethinking the role of normalization in Transformer blocks. Traditional transformers rely on layer normalization placed in standard locations within each block. NormFormer introduces a more extensive normalization strategy that operates at multiple points in the block, including within the multi-head self-attention sublayer and the feed-forward sublayer, and it couples normalization with learnable gain parameters. The approach may also involve normalizing attention scores or residual pathways to reduce internal covariate shift during training.
Key design elements include: per-subblock normalization, learnable scaling after normalization, and optional normalization of attention logits.
Empirical results reported in the literature suggest that NormFormer can achieve faster convergence and competitive or