This example shows how to use Arena to submit a python distributed job and mount an NFS data volume. The sample downloads the source code from git URL.

Set up an NFS server.(refer to: https://www.cnblogs.com/weifeng1463/p/10037803.html )

# install nfs server
➜ yum install nfs-utils -y
# Create local directory of NFS server
➜ mkdir -p /root/nfs/data
# Configure nfs server
➜ cat /etc/exports
/root/nfs/data *(rw,no_root_squash)
# Start nfs server
➜ systemctl start nfs;  systemctl start rpcbind
➜ systemctl enable nfs
Created symlink from /etc/systemd/system/multi-user.target.wants/nfs-server.service to /usr/lib/systemd/system/nfs-server.service.

Download training data to shared directory of NFS.

# Get information of NFS server by showmount, 172.16.0.200 is the host ip of NFS server
➜ showmount -e 172.16.0.200
Export list for 172.16.0.200:
/root/nfs/data *
# Enter shared directory
➜ cd /root/nfs/data
# Prepare training data to shared directory
➜ pwd
/root/nfs/data
# MNIST -> That's the training data we need
➜ ll
total 8.0K
drwxr-xr-x 4  502 games 4.0K 6月  17 16:05 data
drwxr-xr-x 4 root root  4.0K 6月  23 15:17 MNIST

Create PV.

# Note: Typesetting may cause yaml indentation problems
➜ cat nfs-pv.yaml 
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pytorchdata
  labels:
	pytorchdata: nas-mnist
spec:
  persistentVolumeReclaimPolicy: Retain
  capacity:
	storage: 10Gi
  accessModes:
  - ReadWriteMany
  nfs:
	server: 172.16.0.200
	path: "/root/nfs/data"

➜ kubectl create -f nfs-pv.yaml
persistentvolume/pytorchdata created
➜ kubectl get pv | grep pytorchdata
pytorchdata   10Gi       RWX            Retain           Bound    default/pytorchdata                           7m38s

Create PVC.

➜ cat nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pytorchdata
  annotations:
	description: "this is the mnist demo"
	owner: Tom
spec:
  accessModes:
	- ReadWriteMany
  resources:
	requests:
	   storage: 5Gi
  selector:
	matchLabels:
	  pytorchdata: nas-mnist
	  
➜ kubectl create -f nfs-pvc.yaml
persistentvolumeclaim/pytorchdata created
➜ kubectl get pvc | grep pytorchdata
pytorchdata   Bound    pytorchdata   10Gi       RWX                           2m3s

Check the data volume.

➜ arena data list
NAME         ACCESSMODE     DESCRIPTION             OWNER  AGE
pytorchdata  ReadWriteMany  this is the mnist demo  Tom    2m

Submit the pytorch job through --data pvc_name:container_path mount distributed storage volume.

➜ arena --loglevel info submit pytorch \
        --name=pytorch-data \
        --gpus=1 \
        --workers=2 \
        --image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
        --sync-mode=git \
        --sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
        --data=pytorchdata:/mnist_data \
        "python /root/code/mnist-pytorch/mnist.py --backend gloo --data /mnist_data/data"
configmap/pytorch-data-pytorchjob created
configmap/pytorch-data-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-data created
INFO[0000] The Job pytorch-data has been submitted successfully
INFO[0000] You can run `arena get pytorch-data --type pytorchjob` to check the job status

Get status of volume pytorchdata in one of the instances by kubectl describe.

# Get the details of the this job
➜ arena get pytorch-data
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 56s

NAME          STATUS     TRAINER     AGE  INSTANCE               NODE
pytorch-data  SUCCEEDED  PYTORCHJOB  1m   pytorch-data-master-0  172.16.0.210
pytorch-data  SUCCEEDED  PYTORCHJOB  1m   pytorch-data-worker-0  172.16.0.210

# Get status of volume `pytorchdata` from `pytorch-data-master-0`
➜ kubectl describe pod pytorch-data-master-0 | grep pytorchdata -C 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20-pytorchjob-distributed-data.md

20-pytorchjob-distributed-data.md

Files

20-pytorchjob-distributed-data.md

Latest commit

History

20-pytorchjob-distributed-data.md

File metadata and controls