OCR0065: Create repo on Hugging face for all the datasets we have of OCR #4

ta4tsering · 2024-11-27T05:28:14Z

Description:
So currently we have a lot of OCR data that we have annotated and all of those images are on s3 with each image as a single object and also the cvs files are all in s3. So to make the datasets easily usable and easily accessible I will creating the zip files with all the images and upload to the hugging face repo with the transcriptions and all with the data split or data distribution that eric used.

Completion Criteria:

All the Tibetan OCR data uploaded to Openpecha hugging face.

Subtasks:

note:
for the Norbuketaka and Google books, we already have a hugging face repo but without the data distributions so I am using that hugging face repo to create the new hugging face repo on Openpecha hugging face with the data distribution but without the zipped image file

Card Reviewer:

@10kalden

ta4tsering · 2024-11-28T06:33:16Z

https://huggingface.co/datasets/openpecha/OCR-Lhasakanjur

kaldan007 · 2024-11-29T06:39:16Z

@ta4tsering kindly reach out to @gangagyatso4364 regarding how to combine the uchen dataset in one hugging face dataset.

kaldan007 · 2024-12-05T06:38:37Z

Kindly add url to the dataset

ta4tsering · 2024-12-11T05:30:13Z

https://huggingface.co/datasets/openpecha/OCR-Durtsa
https://huggingface.co/datasets/openpecha/OCR-Betsug
https://huggingface.co/datasets/openpecha/OCR-Lithangkanjur
https://huggingface.co/datasets/openpecha/OCR-Google_Books

ta4tsering · 2024-12-11T09:29:53Z

https://huggingface.co/datasets/openpecha/OCR-Norbuketaka
above is the norbuketaka hugging face repo and below is the missing images from the data.
norbuketaka_missing.txt
and below is the data distribution used by eric.
norbuketaka data distribution

ta4tsering · 2024-12-11T09:30:17Z

I wasnt able to do for the Derge Tenjur, it is taking way too long to fix the issue which is that the images arent present in the zip as it is compromised when downloaded from the hugging face and since it is not on the s3 I cant use the url as well.

ta4tsering added this to OCR Dev Nov 27, 2024

ta4tsering self-assigned this Nov 27, 2024

ta4tsering converted this from a draft issue Nov 27, 2024

kaldan007 moved this from IN PROGRESS to DONE in OCR Dev Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR0065: Create repo on Hugging face for all the datasets we have of OCR #4

OCR0065: Create repo on Hugging face for all the datasets we have of OCR #4

ta4tsering commented Nov 27, 2024 •

edited by kaldan007

Loading

ta4tsering commented Nov 28, 2024

kaldan007 commented Nov 29, 2024

kaldan007 commented Dec 5, 2024

ta4tsering commented Dec 11, 2024 •

edited

Loading

ta4tsering commented Dec 11, 2024

ta4tsering commented Dec 11, 2024 •

edited

Loading

OCR0065: Create repo on Hugging face for all the datasets we have of OCR #4

OCR0065: Create repo on Hugging face for all the datasets we have of OCR #4

Comments

ta4tsering commented Nov 27, 2024 • edited by kaldan007 Loading

ta4tsering commented Nov 28, 2024

kaldan007 commented Nov 29, 2024

kaldan007 commented Dec 5, 2024

ta4tsering commented Dec 11, 2024 • edited Loading

ta4tsering commented Dec 11, 2024

ta4tsering commented Dec 11, 2024 • edited Loading

ta4tsering commented Nov 27, 2024 •

edited by kaldan007

Loading

ta4tsering commented Dec 11, 2024 •

edited

Loading

ta4tsering commented Dec 11, 2024 •

edited

Loading