Home

GradScaler

GradScaler is a utility used in automatic mixed precision (AMP) training to improve performance and reduce memory usage on modern GPUs. It mitigates numeric underflow when using float16 by dynamically scaling gradients during backpropagation, helping preserve small gradient values that might otherwise vanish.

The core idea is to multiply the loss by a scale factor before backpropagation. After backward, gradients

In PyTorch, GradScaler is available under torch.cuda.amp. A typical workflow involves creating a GradScaler instance and

Benefits include improved memory efficiency and potential speedups on compatible GPUs, along with reduced risk of

are
unscaled
before
the
optimizer
update
to
ensure
the
update
uses
the
correct
magnitude.
If
any
gradient
contains
Inf
or
NaN,
the
step
is
skipped
and
the
scale
factor
is
reduced;
if
training
proceeds
without
overflow
for
a
stretch
of
steps,
the
scale
factor
is
increased.
This
dynamic
loss
scaling
is
designed
to
be
transparent
to
the
user
while
maintaining
numerical
stability.
performing
the
forward
pass
inside
autocast,
then
calling
scale(loss).backward(),
followed
by
scaler.step(optimizer)
and
scaler.update().
Users
may
optionally
unscale
gradients
before
applying
clipping.
The
scaler
handles
the
detection
of
non-finite
gradients
and
adjusts
the
scale
accordingly,
enabling
more
stable
training
with
mixed
precision.
gradient
underflow
when
training
large
models.
Limitations
include
the
need
for
hardware
and
operator
support
for
AMP,
occasional
overhead
from
scaling
logic,
and
potential
complications
with
certain
custom
operations.
GradScaler
is
widely
used
in
conjunction
with
automatic
mixed
precision
to
facilitate
efficient
neural
network
training.