Cloud Parallelism Project for TCSS 562 Fall 2024
SUMMARY: Objective: To evaluate and optimize GPU hardware and pricing configurations to achieve efficient, cost-effective training of large-scale Automatic Speech Recognition models.
-
Designed and implemented a distributed training pipeline for Wav2Vec2 on the 100-hour LibriSpeech dataset, leveraging AWS SageMaker's smdistributed module for multi-GPU parallelism.
-
Conducted comparative analysis of single-GPU (ml.g4dn.2xlarge) vs. multi-GPU (ml.g4dn.12xlarge) setups, isolating the impact of hardware distribution on training throughput, GPU utilization, and network latency.
-
Experimentally determined that multi-GPU setups reduced epoch training time by up to 50% and nearly doubled throughput (e.g., from 30.8 to 61.3 samples/second for 4 GPUs), while lowering inter-node communication latency by 7.6%.
-
Analyzed cost-performance tradeoffs between on-demand and spot instance clusters, showing that spot instances achieved up to 35% cost savings with minimal performance trade-offs, reducing total training costs by $7.96 for 12 GPUs.
-
Identified diminishing returns in scaling GPU clusters beyond 8 GPUs due to data pipeline and synchronization bottlenecks, highlighting critical areas for optimizing distributed training.
References