Home

tfdistribute

tf.distribute is a TensorFlow module that provides distributed training APIs to scale machine learning workloads across multiple devices and machines. It defines a family of distribution strategies that enable data parallelism and, in some cases, model parallelism, while handling variable synchronization and input distribution in a unified way.

Key strategies include OneDeviceStrategy, MirroredStrategy, MultiWorkerMirroredStrategy, and TPUStrategy. OneDeviceStrategy runs on a single device and is

Usage typically involves creating a strategy and entering strategy.scope() to build your model and related assets.

tf.distribute is designed to work with the TensorFlow 2.x eager execution model and integrates with Keras, tf.data,

Historically, tf.distribute was introduced to unify distributed training in TensorFlow 2.x, building on earlier experimental APIs

useful
for
development
or
debugging.
MirroredStrategy
replicates
the
model
on
multiple
GPUs
on
one
machine
and
performs
synchronous
gradient
updates.
MultiWorkerMirroredStrategy
extends
this
approach
across
multiple
machines.
TPUStrategy
targets
training
on
Tensor
Processing
Units.
Depending
on
the
TensorFlow
version,
additional
strategies
such
as
CentralStorageStrategy
or
ParameterServerStrategy
may
be
available
for
specialized
setups.
A
strategy
defines
a
scope
within
which
variables
and
computations
are
placed
on
the
appropriate
devices.
For
Keras,
training
can
be
performed
inside
the
scope,
using
the
standard
compile
and
fit
workflow.
For
custom
training
loops,
per-replica
computations
are
executed
via
strategy.run,
and
results
are
aggregated
with
strategy.reduce.
The
API
also
provides
strategy.experimental_distribute_dataset
to
distribute
input
pipelines
across
replicas.
and
custom
training
code.
Effective
scaling
requires
appropriate
batch
sizes
per
replica
and
careful
tuning
of
learning
rates
and
data
pipelines.
The
availability
and
behavior
of
certain
strategies
depend
on
hardware,
TensorFlow
version,
and
cluster
configuration.
and
TF
1.x
concepts,
and
is
maintained
as
part
of
the
core
TensorFlow
library.