Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenML API parquet migration - Phase 2 #1154

Open
3 tasks
prabhant opened this issue Jun 20, 2022 · 1 comment
Open
3 tasks

OpenML API parquet migration - Phase 2 #1154

prabhant opened this issue Jun 20, 2022 · 1 comment
Labels
CoreSystem All issues related to the API specification and database schema priority: high

Comments

@prabhant
Copy link
Contributor

We already have dataset download support in parquet and MinIO, now the next phase is uploading these datasets.

We need to allow parquet upload directly to MinIO. For this there are 3 components which are needed to be changed:

  • OpenML client APIs(python/R/Java): To convert dataset directly from dataframe to parquet and send an upload request.
  • OpenML API: Assign the uploaded dataset ID and then transfer it to the MinIO. (we already have scripts for transfer)
  • OpenML Evaluation engine, to process the parquet datasets.

@PGijsbers @joaquinvanschoren @janvanrijn

@prabhant prabhant added priority: high CoreSystem All issues related to the API specification and database schema labels Jun 20, 2022
@PGijsbers
Copy link
Contributor

I created openml/openml-python#1141.
Can you elaborate on the new sequence of communication for uploading the dataset from a client API?
Are the new endpoints already available?

Assign the uploaded dataset ID and then transfer it to the MinIO.

Seems like the server will put the dataset in the MinIO bucket while

To convert dataset directly from dataframe to parquet and send an upload request.

makes it sound as though the client is expected to upload directly to the MinIO server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CoreSystem All issues related to the API specification and database schema priority: high
Projects
None yet
Development

No branches or pull requests

2 participants