Distributed Data Parallel (DDP) vs Fully Sharded Data Parallel (FSDP)

Author: guiferviz

Created: 2025-11-07

Last Modified: 2025-11-07

Both DDP (Distributed Data Parallel) and FSDP (Fully Sharded Data Parallel) are strategies for training models across multiple GPUs, but they handle model parameters differently.

DDP — Distributed Data Parallel

Each GPU holds a full copy of the model.
The input batch is split across GPUs.
Each GPU computes gradients locally, and all GPUs synchronize them at the end of each step.
Simple and efficient, but limited by GPU memory since each device stores the entire model.

FSDP — Fully Sharded Data Parallel

The model’s parameters and gradients are sharded (split) across GPUs.
Each GPU only stores a portion of the model at any time.
During forward and backward passes, the required shards are temporarily gathered and then released.
Much more memory-efficient: allows training very large models that would not fit with DDP.