tdistributed
tdistributed is a distributed training framework designed to coordinate scalable machine learning workloads across multi-node, multi-GPU clusters. It provides mechanisms for data parallelism, model parallelism, and pipeline parallelism, and supports multiple communication backends including NCCL, MPI, and Gloo. The framework aims to offer both synchronous and asynchronous execution models, elastic scaling, and fault tolerance to accommodate dynamic clusters.
Architecture and operation: A central scheduler or controller assigns work to a pool of worker processes that
Features: Elastic resource management that permits nodes to join or leave the training run; automatic device
Usage and ecosystem: Typical usage involves defining a cluster specification, choosing a parallelism strategy, and launching
See also: Horovod, Ray, TensorFlow tf.distribute, PyTorch Distributed, Dask.