Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: training-integration uat by updating image #68

Merged
merged 2 commits into from
Jun 17, 2024

Conversation

orfeas-k
Copy link
Contributor

As the updated notebook description mentions,

  • Update PyTorchJob image according to upstream E2E tests. The only difference is that we use v1-3a360ba (the most recently pushed image) to avoid using latest, since this can result in inconsistent test runs.
  • Update registry from which PaddleJob image is pulled to follow upstream E2E tests.

Closes canonical/bundle-kubeflow#894, canonical/bundle-kubeflow#910

Test the PR

  1. Deploy CKF in any environment. It turns out that this issue was the same in Microk8s, AKS, EKS everywhere with juju 3.5.0 agent
  2. Run training integration UAT either in a notebook or by checking out to this branch in uats repo and running tox -e kubeflow-remote -- --filter="training"
  3. See it succeeding

Copy link
Contributor

@DnPlas DnPlas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two tiny comments, other than that LGTM.

tests/notebooks/training/training-integration.ipynb Outdated Show resolved Hide resolved
@DnPlas
Copy link
Contributor

DnPlas commented Jun 12, 2024

I can confirm the tests pass now:

============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-8.2.2, pluggy-1.5.0 -- /opt/conda/bin/python3.8
cachedir: .pytest_cache
rootdir: /tests/.worktrees/bd99a83476a34b1ea6c6a7d9ca4ceff6191eab3d/tests
configfile: pytest.ini
plugins: anyio-3.6.2
collecting ... collected 9 items / 8 deselected / 1 selected

test_notebooks.py::test_notebook[training-integration]
-------------------------------- live log call ---------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running training-integration.ipynb...
PASSED                                                                   [100%]

================= 1 passed, 8 deselected in 917.94s (0:15:17) ==================
PASSED
------------------------------------------------------------------------------- live log teardown --------------------------------------------------------------------------------
INFO     test_kubeflow_workloads:test_kubeflow_workloads.py:82 Deleting Profile test-kubeflow...
INFO     httpx:_client.py:1013 HTTP Request: DELETE https://172.31.15.25:16443/apis/kubeflow.org/v1/profiles/test-kubeflow "HTTP/1.1 200 OK"
INFO     test_kubeflow_workloads:test_kubeflow_workloads.py:141 Deleting Job test-kubeflow/test-kubeflow...
INFO     httpx:_client.py:1013 HTTP Request: DELETE https://172.31.15.25:16443/apis/batch/v1/namespaces/test-kubeflow/jobs/test-kubeflow "HTTP/1.1 200 OK"


========================================================================= 2 passed in 1154.92s (0:19:14) =========================================================================
  kubeflow-remote: OK (1172.59=setup[16.52]+cmd[1156.07] seconds)
  congratulations :) (1172.67 seconds)

@orfeas-k orfeas-k merged commit ad0922d into main Jun 17, 2024
1 check passed
@orfeas-k orfeas-k deleted the kf-5650-fix-training-operator branch June 17, 2024 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ci(aks): Training-operator UAT fails on AKS k8s 1.28
3 participants