Home

NormFormer

NormFormer is a family of transformer variants that seeks to improve training stability and performance by rethinking the role of normalization in Transformer blocks. Traditional transformers rely on layer normalization placed in standard locations within each block. NormFormer introduces a more extensive normalization strategy that operates at multiple points in the block, including within the multi-head self-attention sublayer and the feed-forward sublayer, and it couples normalization with learnable gain parameters. The approach may also involve normalizing attention scores or residual pathways to reduce internal covariate shift during training.

Key design elements include: per-subblock normalization, learnable scaling after normalization, and optional normalization of attention logits.

Empirical results reported in the literature suggest that NormFormer can achieve faster convergence and competitive or

These
changes
aim
to
stabilize
gradient
flow,
mitigate
training
fragility
at
large
scale,
and
allow
for
faster
convergence
without
sacrificing
representational
capacity.
In
practice,
NormFormer
can
be
integrated
with
standard
pre-
or
post-LN
configurations
and
is
compatible
with
common
optimization
setups.
improved
accuracy
on
a
range
of
NLP
and
vision
tasks
compared
with
baseline
transformers.
The
method
may
incur
modest
computational
overhead
due
to
additional
normalization
components
and
slightly
increased
parameter
count,
but
is
designed
to
be
broadly
compatible
with
existing
training
pipelines.