Skip to content

Commit

Permalink
Cleaner argument handling & nlp/common/ folder (#16)
Browse files Browse the repository at this point in the history
By moving the arguments into their own dataclass (available in Python 3.7), we can group certain types of arguments, such as ModelArguments and SageMakerArguments. This lets us consolidate the sagemaker scripts into a single file, and makes the arguments simpler to pass around in functions.

Moves several files to common/. Users will need to set PYTHONPATH=/path/to/deep-learning-models/nlp. Also fixes PYTHONPATH to /opt/ml/... in the SageMaker container, so those jobs should run.

Also adds support to log hyperparameters in TensorBoard.
  • Loading branch information
jarednielsen authored May 23, 2020
1 parent baae6c0 commit f6f74fb
Show file tree
Hide file tree
Showing 17 changed files with 459 additions and 362 deletions.
37 changes: 35 additions & 2 deletions models/nlp/albert/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@ Language models help AWS customers to improve search results, text classificatio
3. Create an Amazon Elastic Container Registry (ECR) repository. Then build a Docker image from `docker/ngc_sagemaker.Dockerfile` and push it to ECR.

```bash
export IMAGE=${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/${REPO}:ngc_tf21_sagemaker
export ACCOUNT_ID=
export REPO=
export IMAGE=${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/${REPO}:ngc_tf210_sagemaker
docker build -t ${IMAGE} -f docker/ngc_sagemaker.Dockerfile .
$(aws ecr get-login --no-include-email)
docker push ${IMAGE}
Expand All @@ -39,8 +41,13 @@ export SAGEMAKER_SECURITY_GROUP_IDS=sg-123,sg-456
5. Launch the SageMaker job.

```bash
python sagemaker_pretraining.py \
# Add the main folder to your PYTHONPATH
export PYTHONPATH=$PYTHONPATH:/path/to/deep-learning-models/models/nlp

python launch_sagemaker.py \
--source_dir=. \
--entry_point=run_pretraining.py \
--sm_job_name=albert-pretrain \
--instance_type=ml.p3dn.24xlarge \
--instance_count=1 \
--load_from=scratch \
Expand All @@ -52,9 +59,35 @@ python sagemaker_pretraining.py \
--total_steps=125000 \
--learning_rate=0.00176 \
--optimizer=lamb \
--log_frequency=10 \
--name=myfirstjob
```

6. Launch a SageMaker finetuning job.

```bash
python launch_sagemaker.py \
--source_dir=. \
--entry_point=run_squad.py \
--sm_job_name=albert-squad \
--instance_type=ml.p3dn.24xlarge \
--instance_count=1 \
--load_from=scratch \
--model_type=albert \
--model_size=base \
--batch_size=6 \
--total_steps=8144 \
--warmup_steps=814 \
--learning_rate=3e-5 \
--task_name=squadv2
```

7. Enter the Docker container to debug and edit code.

```bash
docker run -it -v=/fsx:/fsx --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm ${IMAGE} /bin/bash
```

<!-- ### Training results
These will be posted shortly. -->
123 changes: 0 additions & 123 deletions models/nlp/albert/arguments.py

This file was deleted.

54 changes: 54 additions & 0 deletions models/nlp/albert/launch_sagemaker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import argparse
import dataclasses

from transformers import HfArgumentParser

from common.arguments import (
DataTrainingArguments,
LoggingArguments,
ModelArguments,
SageMakerArguments,
TrainingArguments,
)
from common.sagemaker_utils import launch_sagemaker_job

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser = HfArgumentParser(
(
ModelArguments,
DataTrainingArguments,
TrainingArguments,
LoggingArguments,
SageMakerArguments,
)
)
model_args, data_args, train_args, log_args, sm_args = parser.parse_args_into_dataclasses()

hyperparameters = dict()
for args in [model_args, data_args, train_args, log_args]:
for key, value in dataclasses.asdict(args).items():
if value is not None:
hyperparameters[key] = value
hyperparameters["fsx_prefix"] = "/opt/ml/input/data/training"

instance_abbr = {
"ml.p3dn.24xlarge": "p3dn",
"ml.p3.16xlarge": "p316",
"ml.g4dn.12xlarge": "g4dn",
}[sm_args.instance_type]
job_name = f"{sm_args.sm_job_name}-{sm_args.instance_count}x{instance_abbr}"

launch_sagemaker_job(
hyperparameters=hyperparameters,
job_name=job_name,
source_dir=sm_args.source_dir,
entry_point=sm_args.entry_point,
instance_type=sm_args.instance_type,
instance_count=sm_args.instance_count,
role=sm_args.role,
image_name=sm_args.image_name,
fsx_id=sm_args.fsx_id,
subnet_ids=sm_args.subnet_ids,
security_group_ids=sm_args.security_group_ids,
)
Loading

0 comments on commit f6f74fb

Please sign in to comment.