OCR0077: Freeze all the test dataset from all the training data hugging face repo into a benchmark HF repo. #8

10kalden · 2025-01-02T06:25:56Z

Description:
We need to consolidate all the test datasets from the data distribution of all the models in the OpenPecha hf dataset into a benchmark hf repo.

Implementation:

extract all the test data from the repo, download the image and create csv with metadata such as id, image_url, image_label, print_method script etc. (upload the required images to s3 and create URL).
convert into parquet
upload all the test datasets to a single repo in the openpecha hf dataset

Subtask:

Extract all the test dataset
Download the image from s3 and zip it
Create a uniform data structure
convert to parquet and upload to hf
Write a proper documentation for the dataset in the hf repo

Completion Criteria:
To Create a hf dataset repo with all the test dataset

Card Reviewer:

@ta4tsering

ta4tsering · 2025-01-06T05:10:35Z

use this repo to work on https://github.com/OpenPecha/OCR_data_distribution and there are already scripts written for some of the things that you need to do. Don't create a new repo and work on it every time there is a new card.

ta4tsering · 2025-01-06T06:16:45Z

you need to upload the zip file with all the images in it in the hf repo as well and then you also need to include more metadata then you have mentioned up there, such as the name of the image batch group like lhasa kanjur or google books or norbuketaka and such, and then also the work id of the BDRC book is that is avaialable, also include the writing style, print method in the metadata

10kalden · 2025-01-08T06:21:52Z

ocr-benchmark-dataset
https://huggingface.co/datasets/openpecha/OCR-Tibetan_line_to_text_benchmark

ta4tsering · 2025-01-13T06:33:56Z

Add two more data to the repo
NorbuektakaNumbers and KhentseWangpo

10kalden self-assigned this Jan 2, 2025

10kalden transferred this issue from OpenPecha/hf-line-segmentation Jan 6, 2025

10kalden moved this from IN PROGRESS to TESTING in OCR Dev Jan 10, 2025

ta4tsering moved this from TESTING to IN PROGRESS in OCR Dev Jan 13, 2025

kaldan007 moved this from IN PROGRESS to DONE in OCR Dev Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR0077: Freeze all the test dataset from all the training data hugging face repo into a benchmark HF repo. #8

OCR0077: Freeze all the test dataset from all the training data hugging face repo into a benchmark HF repo. #8

10kalden commented Jan 2, 2025 •

edited

Loading

ta4tsering commented Jan 6, 2025

ta4tsering commented Jan 6, 2025

10kalden commented Jan 8, 2025 •

edited by ta4tsering

Loading

ta4tsering commented Jan 13, 2025

OCR0077: Freeze all the test dataset from all the training data hugging face repo into a benchmark HF repo. #8

OCR0077: Freeze all the test dataset from all the training data hugging face repo into a benchmark HF repo. #8

Comments

10kalden commented Jan 2, 2025 • edited Loading

ta4tsering commented Jan 6, 2025

ta4tsering commented Jan 6, 2025

10kalden commented Jan 8, 2025 • edited by ta4tsering Loading

ta4tsering commented Jan 13, 2025

10kalden commented Jan 2, 2025 •

edited

Loading

10kalden commented Jan 8, 2025 •

edited by ta4tsering

Loading