Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: carefully replaced minio_url with parquet_url #1280

Merged

Conversation

varshneydevansh
Copy link
Contributor

Reference Issue

What does this PR implement/fix? Explain your changes.

The minio_url field in the dataset description URL is deprecated and will be fully replaced by parquet_url in the future (in the mean time, both are available).

How should this PR be tested?

This PR could be tested by checking the dataset sent from the server, which is the xml response as it was with the minio_url parameter and the relevant parsing for data download is happening correctly or not.

Any other comments?

It's my first PR so might make mistakes and I haven't tested this on my local machine. I am still setting-up the testing environment.

@PGijsbers
Copy link
Collaborator

Thanks @varshneydevansh! I left a minor comment. Otherwise at first glance it looks good to me. However, tests should pass and there seem to be some issues. It is possible that this is not due to your changes, but due to the state of the (test) server. I am not currently up-to-date on that as I am on vacation. I requested a review from @LennartPurucker, hopefully he has time to assist you where necessary and help merge this PR :)

@LennartPurucker if you don't have time, you can remove yourself from the PR. I am on vacation and I knew Matthias is very busy, so I was hoping that you had time for this. Otherwise we'll just have to wait until people are available.

@varshneydevansh
Copy link
Contributor Author

@LennartPurucker if you don't have time, you can remove yourself from the PR. I am on vacation and I knew Matthias is very busy, so I was hoping that you had time for this. Otherwise we'll just have to wait until people are available.

That is why the last commit in this project was made last week. Now I get it. Have your holidays. Meanwhile, I am doing contributions to other projects. =)

@LennartPurucker
Copy link
Contributor

Heyho @varshneydevansh, I can help you. Sorry for the delay, I was busy over the weekend. I read the issue and PR and hope I understood the task. If my feedback sound strange, feel free to correct me as I might not have the full picture. I am now starting the review.

Copy link
Contributor

@LennartPurucker LennartPurucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest looks good to me. I can approve the PR and run the workflows once we updated the documentation. I also double checked the code and think you found all relevant occurrences of the URL.

openml/datasets/dataset.py Outdated Show resolved Hide resolved
Copy link
Contributor

@LennartPurucker LennartPurucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you.

@LennartPurucker
Copy link
Contributor

As a general comment: I am not entirely sure which version we would like this change to be in. This is technically breaking the API. So we might want to go through a deprecation cycle. @PGijsbers, when you are back, could you share your goals for this?

Once we know how to handle this, we can update the progress.rst and merge it.

@varshneydevansh Thank you very much for your help!

@codecov-commenter
Copy link

Codecov Report

Patch coverage: 100.00% and project coverage change: -2.35% ⚠️

Comparison is base (1d7fbda) 85.21% compared to head (f0fa658) 82.87%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1280      +/-   ##
===========================================
- Coverage    85.21%   82.87%   -2.35%     
===========================================
  Files           38       38              
  Lines         5108     5108              
===========================================
- Hits          4353     4233     -120     
- Misses         755      875     +120     
Files Changed Coverage Δ
openml/datasets/dataset.py 87.20% <100.00%> (-0.40%) ⬇️
openml/datasets/functions.py 90.74% <100.00%> (+0.25%) ⬆️

... and 11 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@PGijsbers
Copy link
Collaborator

This is technically breaking the API.

Not sure what you mean with this. The intended way to retrieve the dataset from OpenML is through the parquet_url, the minio_url field is currently only provided for backwards compatibility with packages (such as this) that have not updated yet.

@LennartPurucker
Copy link
Contributor

Ahh okay, I thought minio_url was used more frequently or that users could have used it. But as it was a private property anyway, it does not break the API. Sorry for the confusion.

I used a `next` header instead of a specific version since we don't know if it will be 0.15.0 or 0.14.2. We can change it before the next release.
@PGijsbers PGijsbers merged commit e1ecfe9 into openml:develop Aug 17, 2023
9 of 19 checks passed
@PGijsbers
Copy link
Collaborator

I added the progress note so we could go ahead and merge this 🎉 thanks @varshneydevansh (and Lennart)!

@varshneydevansh varshneydevansh deleted the change-minio_url-to-parquet_url-1142 branch August 17, 2023 09:07
eddiebergman pushed a commit that referenced this pull request Jan 18, 2024
* carefully replaced minio with parquet

* fix: corrected some mistakes

* fix: restored the instances of minio

* fix: updated the documentation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add #1280

I used a `next` header instead of a specific version since we don't know if it will be 0.15.0 or 0.14.2. We can change it before the next release.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pieter Gijsbers <[email protected]>
Co-authored-by: Lennart Purucker <[email protected]>
LennartPurucker added a commit that referenced this pull request Jan 18, 2024
* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/psf/black: 23.3.0 → 23.7.0](psf/black@23.3.0...23.7.0)
- [github.com/pycqa/flake8: 6.0.0 → 6.1.0](PyCQA/flake8@6.0.0...6.1.0)

* Raise correct TypeError and improve type check

* Type check with isinstance instead of type() ==

* Docker enhancement #1277 (#1278)

* Add multiple build platforms

* Automatically update Dockerhub description

* Launch Python instead of Bash by default

* Change `omlp` directory name to less cryptic `openml`

* Change directory to `openml` for running purpose of running script

For mounted scripts, instructions say to mount them to `/openml`,
so we have to `cd` before invoking `python`.

* Update readme to reflect updates (python by default, rename dirs)

* Add branch/code for doc and test examples as they are required

* Ship docker images with readme

* Only update readme on release, also try build docker on PR

* Update the toc descriptions

* fix: carefully replaced minio_url with parquet_url (#1280)

* carefully replaced minio with parquet

* fix: corrected some mistakes

* fix: restored the instances of minio

* fix: updated the documentation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add #1280

I used a `next` header instead of a specific version since we don't know if it will be 0.15.0 or 0.14.2. We can change it before the next release.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pieter Gijsbers <[email protected]>
Co-authored-by: Lennart Purucker <[email protected]>

* Pytest/utils (#1269)

* Extract mocked_perform_api_call because its independent of object

* Remove _multiprocess_can_split_ as it is a nose directive

and we use pytest

* Convert test list all

* Add markers and refactor test_list_all_for_tasks for pytest

* Add cache marker

* Converted remainder of tests to pytest

* Documented remaining Attributes of classes and functions (#1283)

Add documentation and type hints for the remaining attributes of classes and functions.

---------

Co-authored-by: Lennart Purucker <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [pre-commit.ci] pre-commit autoupdate (#1281)

updates:
- [github.com/psf/black: 23.7.0 → 23.11.0](psf/black@23.7.0...23.11.0)
- [github.com/pre-commit/mirrors-mypy: v1.4.1 → v1.7.0](pre-commit/mirrors-mypy@v1.4.1...v1.7.0)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [pre-commit.ci] pre-commit autoupdate (#1291)

updates:
- [github.com/psf/black: 23.11.0 → 23.12.1](psf/black@23.11.0...23.12.1)
- [github.com/pre-commit/mirrors-mypy: v1.7.0 → v1.8.0](pre-commit/mirrors-mypy@v1.7.0...v1.8.0)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Bump actions/checkout from 3 to 4 (#1284)

Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v3...v4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker/login-action from 2 to 3 (#1285)

Bumps [docker/login-action](https://github.com/docker/login-action) from 2 to 3.
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](docker/login-action@v2...v3)

---
updated-dependencies:
- dependency-name: docker/login-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker/metadata-action from 4 to 5 (#1286)

Bumps [docker/metadata-action](https://github.com/docker/metadata-action) from 4 to 5.
- [Release notes](https://github.com/docker/metadata-action/releases)
- [Upgrade guide](https://github.com/docker/metadata-action/blob/master/UPGRADE.md)
- [Commits](docker/metadata-action@v4...v5)

---
updated-dependencies:
- dependency-name: docker/metadata-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker/setup-qemu-action from 2 to 3 (#1287)

Bumps [docker/setup-qemu-action](https://github.com/docker/setup-qemu-action) from 2 to 3.
- [Release notes](https://github.com/docker/setup-qemu-action/releases)
- [Commits](docker/setup-qemu-action@v2...v3)

---
updated-dependencies:
- dependency-name: docker/setup-qemu-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker/build-push-action from 4 to 5 (#1288)

Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 4 to 5.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@v4...v5)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add more type annotations (#1261)

* Add more type annotations

* add to progress rst

---------

Co-authored-by: Lennart Purucker <[email protected]>

* Bump docker/setup-buildx-action from 2 to 3 (#1292)

Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 2 to 3.
- [Release notes](https://github.com/docker/setup-buildx-action/releases)
- [Commits](docker/setup-buildx-action@v2...v3)

---
updated-dependencies:
- dependency-name: docker/setup-buildx-action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump actions/setup-python from 4 to 5 (#1293)

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 4 to 5.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v4...v5)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Rework Tagging Tests for New Server Specification  (#1294)

* rework tagging test adjusted for new server specification

* update progress.rst

* ci: Update tooling (#1298)

* ci: Migrate everything to pyproject.toml

* style: Apply ruff fixes

* style: Apply (more) ruff fixes

* style(ruff): Add some file specific ignores

* ci: Fix build with setuptools

* ci(ruff): Allow prints in cli.py

* fix: Circular import

* test: Use raises(..., match=)

* fix(cli): re-add missing print statements

* fix: Fix some over ruff-ized code

* add make check to documentation

* ci: Change to `build` for sdist

* ci: Add ruff to `"test"` deps

---------

Co-authored-by: Lennart Purucker <[email protected]>

* ci: Disable 3.6 tests (#1302)

* fix: Chipping away at ruff lints (#1303)

* fix: Chipping away at ruff lints

* fix: return lockfile path

* Update openml/config.py

* Update openml/runs/functions.py

* Update openml/tasks/functions.py

* Update openml/tasks/split.py

* Update openml/utils.py

* Update openml/utils.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update openml/config.py

* Update openml/testing.py

* Update openml/utils.py

* Update openml/config.py

* Update openml/utils.py

* Update openml/utils.py

* add concurrency to workflow calls

* adjust docstring

* adjust docstring

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Lennart Purucker <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Lennart Purucker <[email protected]>

* Tagging constraints (#1305)

* update tagging constraints

* openml python tests

* small fix

* [pre-commit.ci] pre-commit autoupdate (#1306)

updates:
- https://github.com/charliermarsh/ruff-pre-commithttps://github.com/astral-sh/ruff-pre-commit
- [github.com/astral-sh/ruff-pre-commit: v0.1.5 → v0.1.11](astral-sh/ruff-pre-commit@v0.1.5...v0.1.11)
- [github.com/python-jsonschema/check-jsonschema: 0.27.1 → 0.27.3](python-jsonschema/check-jsonschema@0.27.1...0.27.3)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Linting Everything - Fix All mypy and ruff Errors (#1307)

* style: Fix linting split.py

* typing: Fix mypy errors split.py

* typing: data_feature

* typing: trace

* more linting fixes

* typing: finish up trace

* typing: config.py

* typing: More fixes on config.py

* typing: setup.py

* finalize runs linting

* typing: evaluation.py

* typing: setup

* ruff fixes across different files and mypy fixes for run files

* typing: _api_calls

* adjust setup files' linting and minor ruff changes

* typing: utils

* late night push

* typing: utils.py

* typing: tip tap tippity

* typing: mypy 78, ruff ~200

* refactor output format name and minor linting stuff

* other: midway merge

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* typing: I'm runnign out of good messages

* typing: datasets

* leinting for flows and some ruff changes

* no more mypy errors

* ruff runs and setups

* typing: Finish off mypy and ruff errors

Co-authored-by: Bilgecelik <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* style: File wide ignores of PLR0913

This is because the automated pre-commit.ci bot which made automatic
commits and pushes would think the `noqa` on the individualy overloaded
functions was not needed. After removing the `noqa`, the linter then
raised the issue

---------

Co-authored-by: eddiebergman <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Bilgecelik <[email protected]>

* ci: Remove Python 3.6/7 (#1308)

* ci: remove 3.7 patch (#1309)

* Make Test Work Again After Ruff and Linter Changes (#1310)

* mark production tests

* make production test run

* fix test bug -1/N

* add retry raise again after refactor

* fix str dict representation

* test: Fix non-writable home mocks

* testing: not not a change

* testing: trigger CI

* typing: Update typing

* ci: Update testing matrix

* testing: Fixup run flow error check

* ci: Manual dispatch, disable double testing

* ci: Prevent further ci duplication

* ci: Add concurrency checks to all

* ci: Remove the max-parallel on test ci

There are a lot less now and they cancel previous
puhes in the same pr now so it shouldn't be a problem anymore

* testing: Fix windows path generation

* add pytest for server state

* add assert cache state

* some formatting

* fix with cache fixture

* finally remove th finally

* doc: Fix link

* update test matrix

* doc: Update to just point to contributing

* add linkcheck ignore for test server

---------

Co-authored-by: eddiebergman <[email protected]>

* Add Feature Descriptions Rebase Clean (#1316)

Co-authored-by: Jan van Rijn <[email protected]>

* Make Class Label Retrieval More Lenient (#1315)

* mark production tests

* make production test run

* fix test bug -1/N

* add retry raise again after refactor

* fix str dict representation

* test: Fix non-writable home mocks

* testing: not not a change

* testing: trigger CI

* typing: Update typing

* ci: Update testing matrix

* testing: Fixup run flow error check

* ci: Manual dispatch, disable double testing

* ci: Prevent further ci duplication

* ci: Add concurrency checks to all

* ci: Remove the max-parallel on test ci

There are a lot less now and they cancel previous
puhes in the same pr now so it shouldn't be a problem anymore

* testing: Fix windows path generation

* add pytest for server state

* add assert cache state

* some formatting

* fix with cache fixture

* finally remove th finally

* doc: Fix link

* update test matrix

* doc: Update to just point to contributing

* add linkcheck ignore for test server

* add special case for class labels that are dtype string

* fix bug and add test

* formatting

---------

Co-authored-by: eddiebergman <[email protected]>

* [pre-commit.ci] pre-commit autoupdate (#1318)

* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/astral-sh/ruff-pre-commit: v0.1.11 → v0.1.13](astral-sh/ruff-pre-commit@v0.1.11...v0.1.13)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix: update fetching a bucket from MinIO (#1314)

* Update fetching a bucket from MinIO

Previously, each dataset had their own bucket:
  https://openml1.win.tue.nl/datasets61/dataset_61.pq

But we were advised to reduce the amount of buckets and
favor hosting many objects in hierarchical structure, so
we now have instead some prefixes to divide up the
dataset objects into separate subdirectories:

  https://openml1.win.tue.nl/datasets/0000/0061/dataset_61.pq

This commit has bypassed pre-commit. Tests should be
updated too.

* ci: Trigger ci

* ci: Add some files to .gitignore

---------

Co-authored-by: PGijsbers <[email protected]>

* Update progress.rst for minor release (#1319)

* Update progress.rst for minor release

* update version number

* fix typo

* Prepare Develop for Merge with Main (#1321)

Co-authored-by: Pieter Gijsbers <[email protected]>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Pieter Gijsbers <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Devansh Varshney (देवांश वार्ष्णेय) <[email protected]>
Co-authored-by: Lennart Purucker <[email protected]>
Co-authored-by: Vishal Parmar <[email protected]>
Co-authored-by: Lennart Purucker <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matthias Feurer <[email protected]>
Co-authored-by: Lennart Purucker <[email protected]>
Co-authored-by: janvanrijn <[email protected]>
Co-authored-by: Lennart Purucker <[email protected]>
Co-authored-by: Bilgecelik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update download dataset parser for change of minio_url to parquet_url
4 participants