Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented progress bar in dataset download #1336

Closed
wants to merge 2 commits into from

Conversation

megansu13
Copy link

  • Modified datasets/dataset.py to add progress bar using tqdm
  • Added new script run_openml.py for example usage of OpenML with progress bar

Reference Issue

What does this PR implement/fix? Explain your changes.

Implemented progress bar using tqdm. Within datasets/dataset.py, I imported requests and tqdm,
modified _download_data() and _parse_data_from_arff() methods, as well as added run_openml.py.

Here is what run_openml.py does:

  1. Fetching Dataset: It loads a dataset from OpenML with a specific ID (in your example, it's ID 1471, which is often the Iris dataset). The get_dataset function has the ability to download the dataset, its associated qualities, and feature metadata if not already present.
  2. Extracting Data: It retrieves the features X and the target variable y from the dataset. The target is specified by the dataset's default target attribute.
  3. Data Exploration: It prints out the first few entries of the feature set X using X.head(), gives a statistical summary of X with X.describe(), and shows the distribution of different classes/targets in y with y.value_counts().
  4. Data Splitting: It splits the data into training and testing sets with an 80-20 ratio using train_test_split. The random_state parameter ensures that the split is reproducible.
  5. Model Training: It initializes a Random Forest classifier, trains it on the training set, and then predicts on the testing set.
  6. Evaluation: It evaluates the predictions by computing the accuracy score, comparing the predicted target values y_pred against the true values y_test.
  7. Progress Bar: It demonstrates a simple progress bar using tqdm, which iterates through a range of 10, pausing for 0.5 seconds at each step.

How should this PR be tested?

This PR should be tested by running:
cd openml
python3 run_openml.py

Any other comments?

Let me know if there is anything else I should provide, I am happy to help!

megansu13 and others added 2 commits April 23, 2024 16:05
- Modified datasets/dataset.py to add progress bar using tqdm
- Added new script run_openml.py for example usage of OpenML with progress bar
@PGijsbers
Copy link
Collaborator

PGijsbers commented Apr 24, 2024

Hi! 👋 First off, thanks so much for taking the effort to contribute to OpenML! It's always a big step to make a first contribution, so we really appreciate it.

Unfortunately, it seems that our efforts got crossed a little bit. A progress bar for MinIO downloads has been added in #1335. However, this PR does also create a progress bar when downloading ARFF files, which is nice (at least for the larger arff files).

There are also a few things about the PR that would need changing regardless (I don't advise you take this step until we see how we handle the multiple PRs, but I wanted to include it for any future work):

  • The PR contains commented out code - we don't want that in our code base. Comments are used to explain why certain code exists and, in rare cases, to explain what code does. Commented out code just makes the source code harder to navigate.
  • The behavior of the _download_data function has been changed. self.data_file and self.parquet_file should be given a value in this function that correspond to the paths on disk. It is likely other parts of the API break if this is not done.
  • The functions _get_dataset_arff and _get_dataset_parquet don't just download the files (like your new function to replace them does), they also take care of caching and calculating checksums to verify the downloads were successful. These functionalities are crucial for our users, so we cannot simply replace them by this new function.
  • The new function which loads arff does not take into account the metadata when deciding the types of data in the dataframe. This will almost certainly lead to datatypes being inferred incorrectly (or at least, inconsistently).
  • A usage example to demonstrate the PR can be provided directly on Github as a code listing. Or, if appropriate, should be added to the documentation. We do not want extra example scripts in other parts of our project structure.

To avoid this from happening again, if you want to contribute, the best way to go about this is to either comment on an open issue to let us know you are interested in it, or open a new issue indicating you want to set up a PR for this. That way, we can first check whether we are (still) interested in the feature, ensure that no one else has been working on it, and provide some pointers on how we think it should best be implemented.

I will go ahead and close this PR, as it would need major rewrites. This does not mean we don't appreciate the work and are not open to work in this direction! It is just for our own organisation, so we can see the state of PRs.

We would be open to having a progress bar for (large) arff file downloads to accompany #1335, but the changes should be much less invasive. I think a good place to start would be to look at this function and the functions it calls to actually download the file. If you are still interested in this, I would propose that you have a look at that, and maybe share your findings and make a suggestion in the related issue #1333 before working on a new PR. Looking forward to hearing from you :)

@PGijsbers PGijsbers closed this Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants