You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
We need to consolidate all the test datasets from the data distribution of all the models in the OpenPecha hf dataset into a benchmark hf repo.
Implementation:
extract all the test data from the repo, download the image and create csv with metadata such as id, image_url, image_label, print_method script etc. (upload the required images to s3 and create URL).
convert into parquet
upload all the test datasets to a single repo in the openpecha hf dataset
Subtask:
Extract all the test dataset
Download the image from s3 and zip it
Create a uniform data structure
convert to parquet and upload to hf
Write a proper documentation for the dataset in the hf repo
Completion Criteria:
To Create a hf dataset repo with all the test dataset
use this repo to work on https://github.com/OpenPecha/OCR_data_distribution and there are already scripts written for some of the things that you need to do. Don't create a new repo and work on it every time there is a new card.
10kalden
transferred this issue from OpenPecha/hf-line-segmentation
Jan 6, 2025
you need to upload the zip file with all the images in it in the hf repo as well and then you also need to include more metadata then you have mentioned up there, such as the name of the image batch group like lhasa kanjur or google books or norbuketaka and such, and then also the work id of the BDRC book is that is avaialable, also include the writing style, print method in the metadata
Description:
We need to consolidate all the test datasets from the data distribution of all the models in the OpenPecha hf dataset into a benchmark hf repo.
Implementation:
Subtask:
Completion Criteria:
To Create a hf dataset repo with all the test dataset
Card Reviewer:
The text was updated successfully, but these errors were encountered: