PyTorch Distributed Test on High-Flyer AIHPC

We test the different implementations of PyTorch distributed training, and compare the performances.

We recommend that users use Apex to conduct distributed training on High-Flyer AIHPC.

Dataset

ImageNet. We use ffrecord to aggregate the scattered files on High-Flyer AIHPC.

train_data = '/public_dataset/1/ImageNet/train.ffr'
val_data = '/public_dataset/1/ImageNet/val.ffr'

ResNet

torchvision.models.resnet50()

Apex is the most effective implementation to conduct PyTorch distributed training for now.
The acceleration effect is basically the same as the number of GPU.
The deeper the degree of parallelism, the lower the utilization of GPU.