Skip to content

Latest commit

 

History

History
40 lines (28 loc) · 1.28 KB

README.md

File metadata and controls

40 lines (28 loc) · 1.28 KB

PyTorch Distributed Test on High-Flyer AIHPC

We test the different implementations of PyTorch distributed training, and compare the performances.

We recommend that users use Apex to conduct distributed training on High-Flyer AIHPC.

Dataset

ImageNet. We use ffrecord to aggregate the scattered files on High-Flyer AIHPC.

train_data = '/public_dataset/1/ImageNet/train.ffr'
val_data = '/public_dataset/1/ImageNet/val.ffr'

Test Model

ResNet

torchvision.models.resnet50()

Parameters

  • batch_size: 400
  • num_nodes: 1
  • gpus: 8

Results

Summary

  1. Apex is the most effective implementation to conduct PyTorch distributed training for now.
  2. The acceleration effect is basically the same as the number of GPU.
  3. The deeper the degree of parallelism, the lower the utilization of GPU.