Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup automated tests on a SLURM cluster #20

Merged
merged 20 commits into from
Jul 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,9 @@
// note: there's also a SLURM_TMPDIR env variable set to /tmp/slurm_tmpdir in the container.
// NOTE: this assumes that either $SLURM_TMPDIR is set on the host machine (e.g. a compute node)
// or that `/tmp/slurm_tmpdir` exists on the host machine.
"source=${localEnv:SLURM_TMPDIR:/tmp/slurm_tmpdir},target=/tmp,type=bind,consistency=cached"
"source=${localEnv:SLURM_TMPDIR:/tmp/slurm_tmpdir},target=/tmp,type=bind,consistency=cached",
// Mount the ssh directory on the host machine to the container.
"source=${localEnv:HOME}/.ssh,target=/home/vscode/.ssh,type=bind,readonly"
],
"runArgs": [
"--gpus",
Expand All @@ -89,7 +91,8 @@
// doesn't fail.
"initializeCommand": {
"create pdm install cache": "mkdir -p ${SCRATCH?need the SCRATCH environment variable to be set.}/.cache/pdm", // todo: put this on $SCRATCH on the host (e.g. compute node)
"create fake SLURM_TMPDIR": "mkdir -p ${SLURM_TMPDIR:-/tmp/slurm_tmpdir}" // this is fine on compute nodes
"create fake SLURM_TMPDIR": "mkdir -p ${SLURM_TMPDIR:-/tmp/slurm_tmpdir}", // this is fine on compute nodes
"create ssh cache dir": "mkdir -p ~/.cache/ssh"
},
// NOTE: Getting some permission issues with the .cache dir if mounting .cache/pdm to
// .cache/pdm in the container. Therefore, here I'm making a symlink from ~/.cache/pdm to
Expand Down
64 changes: 62 additions & 2 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@
name: Python application

on:
push:
branches:
- master
pull_request:

permissions:
Expand Down Expand Up @@ -61,7 +64,7 @@ jobs:
name: coverage-reports-unit-tests-${{ matrix.platform }}-${{ matrix.python-version }}
path: ./coverage.xml

integration_tests:
local_integration_tests:
needs: [unit_tests]
runs-on: self-hosted
strategy:
Expand Down Expand Up @@ -91,9 +94,66 @@ jobs:
name: coverage-reports-integration-tests-${{ matrix.python-version }}
path: ./coverage.xml

launch-slurm-actions-runner:
needs: [local_integration_tests]
runs-on: self-hosted
strategy:
max-parallel: 5
matrix:
cluster: ['mila'] #, 'narval', 'beluga']
outputs:
job_id: ${{ steps.sbatch.outputs.stdout }}
steps:
- name: Copy job script to the cluster
run: "scp actions-runner-job.sh ${{ matrix.cluster }}:actions-runner-job.sh"

- name: Launch Slurm Actions Runner
id: sbatch
# TODO: for DRAC clusters, the account needs to be set somehow (and obviously not be hard-coded here).
# Output the job ID to a file so that the next step can use it.
# NOTE: Could also use the --wait flag to wait for the job to finish (and have this run at the same time as the other step).
run: |
job_id=`ssh ${{ matrix.cluster }} 'cd $SCRATCH && sbatch --parsable $HOME/actions-runner-job.sh'`
echo "Submitted job $job_id on the ${{ matrix.cluster }} cluster!"
echo "job_id=$job_id" >> "$GITHUB_OUTPUT"

# This step runs in a self-hosted Github Actions runner inside a SLURM job on the compute node of the cluster.
slurm_integration_tests:
name: Run integration tests on the ${{ matrix.cluster }} cluster in job ${{ needs.launch-slurm-actions-runner.outputs.job_id}}
needs: [launch-slurm-actions-runner]
runs-on: ${{ matrix.cluster }}
strategy:
max-parallel: 5
matrix:
# TODO: this should be tied to the same setting in the `launch-slurm-actions-runner` job.
# cluster: ${{ needs.launch-slurm-actions-runner.strategy.matrix.cluster }}
cluster: ['mila']
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: 3.12
- run: pip install pdm
- name: Install dependencies
run: pdm install

- name: Test with pytest
run: pdm run pytest -v --cov=project --cov-report=xml --cov-append

# TODO: Re-enable this later
# - name: Test with pytest (only slow tests)
# run: pdm run pytest -v -m slow --slow --cov=project --cov-report=xml --cov-append

- name: Store coverage report as an artifact
uses: actions/upload-artifact@v4
with:
name: coverage-reports-slurm-integration-tests-${{ matrix.cluster }}
path: ./coverage.xml

# https://about.codecov.io/blog/uploading-code-coverage-in-a-separate-job-on-github-actions/
upload-coverage-codecov:
needs: [integration_tests]
needs: [local_integration_tests, slurm_integration_tests]
runs-on: ubuntu-latest
name: Upload coverage reports to Codecov
steps:
Expand Down
82 changes: 82 additions & 0 deletions actions-runner-job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16G
#SBATCH --gpus=rtx8000:1
#SBATCH --time=00:30:00
#SBATCH --dependency=singleton
#SBATCH --output=logs/runner_%j.out


set -euo pipefail

# module --quiet purge
# module load cuda/12.2.2



archive="actions-runner-linux-x64-2.317.0.tar.gz"
# Look for the actions-runner archive on $SCRATCH first. Download it if it doesn't exist.
if [ ! -f "$SCRATCH/$archive" ]; then
curl -o $SCRATCH/$archive \
-L "https://github.com/actions/runner/releases/download/v2.317.0/$archive"
fi
# Make a symbolic link in SLURM_TMPDIR.
ln --symbolic --force $SCRATCH/$archive $SLURM_TMPDIR/$archive

cd $SLURM_TMPDIR

echo "9e883d210df8c6028aff475475a457d380353f9d01877d51cc01a17b2a91161d $archive" | shasum -a 256 -c

# Extract the installer
tar xzf ./actions-runner-linux-x64-2.317.0.tar.gz

# NOTE: Could use this to get a token programmatically!
# https://docs.github.com/en/rest/actions/self-hosted-runners?apiVersion=2022-11-28#create-a-registration-token-for-an-organization

# cluster=${SLURM_CLUSTER_NAME:-local}
cluster=${SLURM_CLUSTER_NAME:-`hostname`}

# https://docs.github.com/en/rest/actions/self-hosted-runners?apiVersion=2022-11-28#create-a-registration-token-for-a-repository
# curl -L \
# -X POST \
# -H "Accept: application/vnd.github+json" \
# -H "Authorization: Bearer <YOUR-TOKEN>" \
# -H "X-GitHub-Api-Version: 2022-11-28" \
# https://api.github.com/repos/OWNER/REPO/actions/runners/registration-token

# Example output:
# {
# "token": "XXXXX",
# "expires_at": "2020-01-22T12:13:35.123-08:00"
# }


if ! command -v jq &> /dev/null; then
echo "the jq command doesn't seem to be installed."

if ! test -f ~/.local/bin/jq; then
echo "jq is not found at ~/.local/bin/jq, downloading it."
# TODO: this assumes that ~/.local/bin is in $PATH, I'm not 100% sure that this is standard.
mkdir -p ~/.local/bin
wget https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 -O ~/.local/bin/jq
chmod +x ~/.local/bin/jq
fi
fi

source ~/.bash_aliases

TOKEN=`curl -L \
-X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${SH_TOKEN:?The SH_TOKEN env variable is not set}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
https://api.github.com/repos/mila-iqia/ResearchTemplate/actions/runners/registration-token | ~/.local/bin/jq -r .token`

# Create the runner and configure it programmatically
./config.sh --url https://github.com/mila-iqia/ResearchTemplate --token $TOKEN \
--unattended --replace --name $cluster --labels $cluster $SLURM_JOB_ID --ephemeral

# Launch the actions runner.
./run.sh
24 changes: 14 additions & 10 deletions pdm.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ dependencies = [
"matplotlib>=3.8.3",
"moviepy>=1.0.3",
"pygame==2.5.2",
"jax[cuda12]",
"jax[cuda12]>=0.4.28",
"brax>=0.10.3",
"tensorboard>=2.16.2",
"gymnax>=0.0.8",
"torch-jax-interop @ git+https://www.github.com/lebrice/torch_jax_interop",
"tensor-regression @ git+https://www.github.com/lebrice/tensor_regression",
"torch-jax-interop>=0.0.6",
"tensor-regression>=0.0.4",
"simple-parsing>=0.1.5",
"pydantic==2.7.4",
"milatools>=0.0.18",
Expand Down
Loading