Skip to content

Latest commit

 

History

History
315 lines (220 loc) · 12.6 KB

4_train_test.md

File metadata and controls

315 lines (220 loc) · 12.6 KB

Tutorial 4: Train and test with existing models

MMSegmentation supports training and testing models on a variety of devices, which are described below for single-GPU, distributed, and cluster training and testing, respectively. Through this tutorial, you will learn how to train and test using the scripts provided by MMSegmentation.

Training and testing on a single GPU

Training on a single GPU

We provide tools/train.py to launch training jobs on a single GPU. The basic usage is as follows.

python tools/train.py  ${CONFIG_FILE} [optional arguments]

This tool accepts several optional arguments, including:

  • --work-dir ${WORK_DIR}: Override the working directory.
  • --amp: Use auto mixed precision training.
  • --resume: Resume from the latest checkpoint in the work_dir automatically.
  • --cfg-options ${OVERRIDE_CONFIGS}: Override some settings in the used config, and the key-value pair in xxx=yyy format will be merged into the config file. For example, '--cfg-option model.encoder.in_channels=6'. Please see this guide for more details.

Below are the optional arguments for the multi-gpu test:

  • --launcher: Items for distributed job initialization launcher. Allowed choices are none, pytorch, slurm, mpi. Especially, if set to none, it will test in a non-distributed mode.
  • --local_rank: ID for local rank. If not specified, it will be set to 0.

Note: Difference between the argument --resume and the field load_from in the config file:

--resume only determines whether to resume from the latest checkpoint in the work_dir. It is usually used for resuming the training process that is interrupted accidentally.

load_from will specify the checkpoint to be loaded and the training iteration starts from 0. It is usually used for fine-tuning.

If you would like to resume training from a specific checkpoint, you can use:

python tools/train.py ${CONFIG_FILE} --resume --cfg-options load_from=${CHECKPOINT}

Training on CPU: The process of training on the CPU is consistent with single GPU training if a machine does not have GPU. If it has GPUs but not wanting to use them, we just need to disable GPUs before the training process.

export CUDA_VISIBLE_DEVICES=-1

And then run the script above.

Testing on a single GPU

We provide tools/test.py to launch training jobs on a single GPU. The basic usage is as follows.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

This tool accepts several optional arguments, including:

  • --work-dir: If specified, results will be saved in this directory. If not specified, the results will be automatically saved to work_dirs/{CONFIG_NAME}.
  • --show: Show prediction results at runtime, available when --show-dir is not specified.
  • --show-dir: Directory where painted images will be saved. If specified, the visualized segmentation mask will be saved to the work_dir/timestamp/show_dir.
  • --wait-time: The interval of show (s), which takes effect when --show is activated. Default to 2.
  • --cfg-options: If specified, the key-value pair in xxx=yyy format will be merged into the config file.
  • --tta: Test time augmentation option.

Testing on CPU: The process of testing on the CPU is consistent with single GPU testing if a machine does not have GPU. If it has GPUs but not wanting to use them, we just need to disable GPUs before the training process.

export CUDA_VISIBLE_DEVICES=-1

then run the script above.

Training and testing on multiple GPUs and multiple machines

Training on multiple GPUs

OpenMMLab2.0 implements distributed training with MMDistributedDataParallel. We provide tools/dist_train.sh to launch training on multiple GPUs.

The basic usage is as follows:

sh tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]

Optional arguments remain the same as stated above and have additional arguments to specify the number of GPUs.

An example:

# checkpoints and logs saved in WORK_DIR=work_dirs/pspnet_r50-d8_4xb4-80k_ade20k-512x512/
# If work_dir is not set, it will be generated automatically.
sh tools/dist_train.sh configs/pspnet/pspnet_r50-d8_4xb4-80k_ade20k-512x512.py 8 --work-dir work_dirs/pspnet_r50-d8_4xb4-80k_ade20k-512x512

Note: During training, checkpoints and logs are saved in the same folder structure as the config file under work_dirs/. A custom work directory is not recommended since evaluation scripts infer work directories from the config file name. If you want to save your weights somewhere else, please use a symlink, for example:

ln -s ${YOUR_WORK_DIRS} ${MMSEG}/work_dirs

Testing on multiple GPUs

We provide tools/dist_test.sh to launch testing on multiple GPUs. The basic usage is as follows.

sh tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [optional arguments]

Optional arguments remain the same as stated above and have additional arguments to specify the number of GPUs.

An example:

./tools/dist_test.sh configs/pspnet/pspnet_r50-d8_4xb2-40k_cityscapes-512x1024.py \
    checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth 4

Launch multiple jobs on a single machine

If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be an error message saying RuntimeError: Address already in use. If you use dist_train.sh to launch training jobs, you can set the port in commands with the environment variable PORT.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/dist_train.sh ${CONFIG_FILE} 4

Training with multiple machines

MMSegmentation relies on torch.distributed package for distributed training. Thus, as a basic usage, one can launch distributed training via PyTorch's launch utility.

If you launch with multiple machines simply connected with ethernet, you can simply run the following commands: On the first machine:

NNODES=2 NODE_RANK=0 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS}

On the second machine:

NNODES=2 NODE_RANK=1 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS}

Usually, it is slow if you do not have high-speed networking like InfiniBand.

Manage jobs with Slurm

Slurm is a good job scheduling system for computing clusters.

Training on a cluster with Slurm

On a cluster managed by Slurm, you can use slurm_train.sh to spawn training jobs. It supports both single-node and multi-node training.

The basic usage is as follows:

[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} [optional arguments]

Below is an example of using 4 GPUs to train PSPNet on a Slurm partition named dev, and set the work-dir to some shared file systems.

GPUS=4 sh tools/slurm_train.sh dev pspnet configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py --work-dir work_dir/pspnet

You can check the source code to review full arguments and environment variables.

Testing on a cluster with Slurm

Similar to the training task, MMSegmentation provides slurm_test.sh to launch testing jobs.

The basic usage is as follows:

[GPUS=${GPUS}] sh tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

You can check the source code to review full arguments and environment variables.

Note: When using Slurm, the port option needs to be set in one of the following ways:

  1. Set the port through --cfg-options. This is more recommended since it does not change the original configs.

    GPUS=4 GPUS_PER_NODE=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR} --cfg-options env_cfg.dist_cfg.port=29500
    GPUS=4 GPUS_PER_NODE=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR} --cfg-options env_cfg.dist_cfg.port=29501
  2. Modify the config files to set different communication ports. In config1.py:

    enf_cfg = dict(dist_cfg=dict(backend='nccl', port=29500))

    In config2.py:

    enf_cfg = dict(dist_cfg=dict(backend='nccl', port=29501))

    Then you can launch two jobs with config1.py and config2.py.

    CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
    CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}
  3. Set the port in the command using the environment variable 'MASTER_PORT':

CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 MASTER_PORT=29500 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 MASTER_PORT=29501 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}

Testing and saving segment files

Basic Usage

When you want to save the results, you can use --out to specify the output directory.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} --out ${OUTPUT_DIR}

Here is an example to save the predicted results from model fcn_r50-d8_4xb4-80k_ade20k-512x512 on ADE20k validatation dataset.

python tools/test.py configs/fcn/fcn_r50-d8_4xb4-80k_ade20k-512x512.py ckpt/fcn_r50-d8_512x512_80k_ade20k_20200614_144016-f8ac5082.pth --out work_dirs/format_results

You also can modify the config file to define output_dir. We also take fcn_r50-d8_4xb4-80k_ade20k-512x512 as example just add test_evaluator in configs/fcn/fcn_r50-d8_4xb4-80k_ade20k-512x512.py

test_evaluator = dict(type='IoUMetric', iou_metrics=['mIoU'], output_dir='work_dirs/format_results')

then run command without --out:

python tools/test.py configs/fcn/fcn_r50-d8_4xb4-80k_ade20k-512x512.py ckpt/fcn_r50-d8_512x512_80k_ade20k_20200614_144016-f8ac5082.pth

If you would like to only save the predicted results without evaluation as annotation is not released by the official dataset, you can set format_only=True and modify test_dataloader. As there is no annotation in dataset, we remove dict(type='LoadAnnotations') from test_dataloader Here is the example configuration:

test_evaluator = dict(
    type='IoUMetric',
    iou_metrics=['mIoU'],
    format_only=True,
    output_dir='work_dirs/format_results')
test_dataloader = dict(
    batch_size=1,
    num_workers=4,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type = 'ADE20KDataset'
        data_root='data/ade/release_test',
        data_prefix=dict(img_path='testing'),
        # we don't load annotation in test transform pipeline.
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='Resize', scale=(2048, 512), keep_ratio=True),
            dict(type='PackSegInputs')
        ]))

then run test command:

python tools/test.py configs/fcn/fcn_r50-d8_4xb4-80k_ade20k-512x512.py ckpt/fcn_r50-d8_512x512_80k_ade20k_20200614_144016-f8ac5082.pth

Testing Cityscape dataset and save predicted segment files

We recommend CityscapesMetric which is the wrapper of Cityscapes'sdk, when you want to save the predicted results of Cityscape test dataset to submit them in Cityscape test server. Here is the example configuration:

test_evaluator = dict(
    type='CityscapesMetric',
    format_only=True,
    keep_results=True,
    output_dir='work_dirs/format_results')
test_dataloader = dict(
    batch_size=1,
    num_workers=4,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type='CityscapesDataset',
        data_root='data/cityscapes/',
        data_prefix=dict(img_path='leftImg8bit/test'),
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='Resize', scale=(2048, 1024), keep_ratio=True),
            dict(type='PackSegInputs')
        ]))

then run test command, for example:

python tools/test.py configs/fcn/fcn_r18-d8_4xb2-80k_cityscapes-512x1024.py ckpt/fcn_r18-d8_512x1024_80k_cityscapes_20201225_021327-6c50f8b4.pth