Sharded distributed sampler for cached dataloading in DDP #195

ziw-liu · 2024-10-21T21:15:48Z

Add a distributed sampler that only permutes index within ranks, improving cache hit rate in DDP.

See viscy/scripts/shared_dict.py for usage.

ziw-liu · 2024-10-21T21:25:12Z

Example output:


GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/hpc/mydata/ziwen.liu/anaconda/2022.05/x86_64/envs/viscy/lib/python3.11/site-packages/lightning/pytorch/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------
=== Initializing cache pool for rank 0 ===

=== Initializing cache pool for rank 1 ===

=== Initializing cache pool for rank 2 ===
| Name  | Type   | Params | Mode
0 | layer | Linear | 2      | train
2         Trainable params

0         Non-trainable params

2         Total params

0.000     Total estimated model params size (MB)

1         Modules in train mode

0         Modules in eval mode

Adding 31 to cache dict on rank 1
Adding 32 to cache dict on rank 2
Adding 38 to cache dict on rank 2
Adding 42 to cache dict on rank 0
Adding 30 to cache dict on rank 0
Adding 36 to cache dict on rank 0
Adding 37 to cache dict on rank 1
Adding 43 to cache dict on rank 1
Adding 48 to cache dict on rank 0
Adding 34 to cache dict on rank 1
Adding 49 to cache dict on rank 1
Adding 30 to cache dict on rank 2
Adding 44 to cache dict on rank 2
Adding 41 to cache dict on rank 2
Adding 35 to cache dict on rank 2
Adding 40 to cache dict on rank 1
Adding 46 to cache dict on rank 1
Adding 39 to cache dict on rank 0
Adding 33 to cache dict on rank 0
Adding 47 to cache dict on rank 2
Adding 45 to cache dict on rank 0
Adding 24 to cache dict on rank 2
Adding 13 to cache dict on rank 1
Adding 0 to cache dict on rank 0
Adding 20 to cache dict on rank 2
Adding 4 to cache dict on rank 0
Adding 29 to cache dict on rank 2
Adding 19 to cache dict on rank 1
Adding 26 to cache dict on rank 2
Adding 28 to cache dict on rank 2

=== Starting training ===

=== Starting training epoch 0 ===
Adding 8 to cache dict on rank 0
Adding 15 to cache dict on rank 1
Adding 3 to cache dict on rank 0
Adding 21 to cache dict on rank 2
Adding 11 to cache dict on rank 1
Adding 7 to cache dict on rank 0
Adding 23 to cache dict on rank 2
Adding 27 to cache dict on rank 2
Adding 22 to cache dict on rank 2
Adding 1 to cache dict on rank 0
Adding 9 to cache dict on rank 0
Adding 5 to cache dict on rank 0
Adding 17 to cache dict on rank 1
Adding 6 to cache dict on rank 0
Adding 18 to cache dict on rank 1
Adding 16 to cache dict on rank 1
Adding 14 to cache dict on rank 1
Adding 10 to cache dict on rank 1
Adding 25 to cache dict on rank 2
Adding 2 to cache dict on rank 0
Adding 12 to cache dict on rank 1

=== Starting training epoch 1 ===

=== Starting training epoch 2 ===

=== Starting training epoch 3 ===

=== Starting training epoch 4 ===

Trainer.fit stopped: max_epochs=5 reached.

* update torch >2.4.1 * black * ruff

This reverts commit 8c13f49.

ziw-liu · 2024-10-23T00:57:08Z

viscy/data/hcs_ram.py

+            persistent_workers=bool(self.num_workers),
+            pin_memory=True,
+            shuffle=False,
+            timeout=self.timeout,


@edyoshikun why is this needed?

At the beginning I had to add this timeout if it was taking long time to cache. I don't think we need this and in fact if it's =0 it works fine

ziw-liu marked this pull request as ready for review October 21, 2024 21:17

ziw-liu requested a review from edyoshikun October 21, 2024 21:17

ziw-liu added enhancement New feature or request translation Image translation (VS) labels Oct 21, 2024

ziw-liu changed the base branch from ram_dataloader to main October 21, 2024 23:28

edyoshikun and others added 14 commits October 21, 2024 16:28

caching dataloader

e19ee14

caching data module

d31978d

black

041d738

ruff

7f76174

Bump torch to 2.4.1 (#174)

85ea791

* update torch >2.4.1 * black * ruff

adding timeout to ram_dataloader

1838581

bandaid to cached dataloader

f5c01a3

fixing the dataloader using torch collate_fn

26a06b8

replacing dictionary with single array

f2ff43c

loading prior to epoch 0

5fb96d7

Revert "replacing dictionary with single array"

848cd63

This reverts commit 8c13f49.

using multiprocessing manager

f7e57ae

add sharded distributed sampler

c4797b4

add example script for ddp caching

2c31e7d

ziw-liu force-pushed the simple-cache branch from 728998a to 2c31e7d Compare October 21, 2024 23:29

ziw-liu and others added 5 commits October 21, 2024 16:32

format and lint

5300b4a

addding the custom distrb sampler to hcs_ram.py

8a8b4b0

adding sampler to val train dataloader

49764fa

fix divisibility of the last shard

1fe5491

hcs_ram format and lint

0b005cf

ziw-liu commented Oct 23, 2024

View reviewed changes

path for if not ddp

daa6860

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded distributed sampler for cached dataloading in DDP #195

Sharded distributed sampler for cached dataloading in DDP #195

ziw-liu commented Oct 21, 2024 •

edited

Loading

ziw-liu commented Oct 21, 2024 •

edited

Loading

| Name | Type | Params | Mode

0 | layer | Linear | 2 | train

ziw-liu Oct 23, 2024

edyoshikun Oct 23, 2024

Sharded distributed sampler for cached dataloading in DDP #195

Are you sure you want to change the base?

Sharded distributed sampler for cached dataloading in DDP #195

Conversation

ziw-liu commented Oct 21, 2024 • edited Loading

ziw-liu commented Oct 21, 2024 • edited Loading

| Name | Type | Params | Mode

0 | layer | Linear | 2 | train

ziw-liu Oct 23, 2024

Choose a reason for hiding this comment

edyoshikun Oct 23, 2024

Choose a reason for hiding this comment

ziw-liu commented Oct 21, 2024 •

edited

Loading

ziw-liu commented Oct 21, 2024 •

edited

Loading