khorovod
Horovod is an open-source framework for distributed training of deep learning models. Developed by Uber AI Labs and released in 2018, it enables scalable data-parallel training across many GPUs and machines with minimal code changes. The core idea is to perform gradient averaging across all participating processes using a ring-allreduce algorithm, which overlaps computation and communication to improve efficiency. Horovod supports multiple communication backends, including NVIDIA NCCL for GPU-to-GPU communication, MPI, and Google's Gloo, allowing deployment on single machines with multiple GPUs or on clusters.
Horovod integrates with major deep learning frameworks such as TensorFlow (including Keras), PyTorch, and Apache MXNet.
The project is released under the Apache 2.0 license and is maintained by the community with contributions
Beyond core gradient averaging, Horovod provides optional features such as gradient compression and compatibility with mixed-precision