Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR0077: Freeze all the test dataset from all the training data hugging face repo into a benchmark HF repo. #8

Open
6 tasks done
10kalden opened this issue Jan 2, 2025 · 4 comments
Assignees

Comments

@10kalden
Copy link
Contributor

10kalden commented Jan 2, 2025

Description:
We need to consolidate all the test datasets from the data distribution of all the models in the OpenPecha hf dataset into a benchmark hf repo.

Implementation:

  1. extract all the test data from the repo, download the image and create csv with metadata such as id, image_url, image_label, print_method script etc. (upload the required images to s3 and create URL).
  2. convert into parquet
  3. upload all the test datasets to a single repo in the openpecha hf dataset

Subtask:

  • Extract all the test dataset
  • Download the image from s3 and zip it
  • Create a uniform data structure
  • convert to parquet and upload to hf
  • Write a proper documentation for the dataset in the hf repo

Completion Criteria:
To Create a hf dataset repo with all the test dataset

Card Reviewer:

@10kalden 10kalden self-assigned this Jan 2, 2025
@ta4tsering
Copy link
Contributor

use this repo to work on https://github.com/OpenPecha/OCR_data_distribution and there are already scripts written for some of the things that you need to do. Don't create a new repo and work on it every time there is a new card.

@10kalden 10kalden transferred this issue from OpenPecha/hf-line-segmentation Jan 6, 2025
@ta4tsering
Copy link
Contributor

you need to upload the zip file with all the images in it in the hf repo as well and then you also need to include more metadata then you have mentioned up there, such as the name of the image batch group like lhasa kanjur or google books or norbuketaka and such, and then also the work id of the BDRC book is that is avaialable, also include the writing style, print method in the metadata

@10kalden
Copy link
Contributor Author

10kalden commented Jan 8, 2025

@10kalden 10kalden moved this from IN PROGRESS to TESTING in OCR Dev Jan 10, 2025
@ta4tsering ta4tsering moved this from TESTING to IN PROGRESS in OCR Dev Jan 13, 2025
@ta4tsering
Copy link
Contributor

Add two more data to the repo
NorbuektakaNumbers and KhentseWangpo

@kaldan007 kaldan007 moved this from IN PROGRESS to DONE in OCR Dev Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: DONE
Development

No branches or pull requests

2 participants