Papers
arxiv:2502.06728

DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes

Published on Feb 10
Authors:
,
,
,

Abstract

FlexDeMo, a hybrid sharded data parallel training strategy, reduces inter-node communication by synchronizing only fast-moving gradient components, achieving similar performance to full gradient synchronization while being faster.

AI-generated summary

Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, DeMo assumes that models fit on a single accelerator. We relax this assumption and introduce FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication is reduced by synchronizing only fast-moving components instead of the full gradients -- resulting in a hybrid sharded data parallel training strategy. We further introduce a framework, denoted as DeToNATION, that generalizes DeMo, FlexDeMo, and other popular distributed training schemes such as DiLoCo -- introducing new variations of replication schemes and challenging choices made in DeMo. Our results across language and vision domains show that FlexDeMo attains similar validation loss as hybrid sharded data parallel training employing AdamW and full gradient synchronization, while being substantially faster. FlexDeMo is thus a promising distributed training scheme for the largest machine learning models.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.06728 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.06728 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.06728 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.