Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR0065: Create repo on Hugging face for all the datasets we have of OCR #4

Open
8 of 9 tasks
ta4tsering opened this issue Nov 27, 2024 · 6 comments
Open
8 of 9 tasks
Assignees

Comments

@ta4tsering
Copy link
Contributor

ta4tsering commented Nov 27, 2024

Description:
So currently we have a lot of OCR data that we have annotated and all of those images are on s3 with each image as a single object and also the cvs files are all in s3. So to make the datasets easily usable and easily accessible I will creating the zip files with all the images and upload to the hugging face repo with the transcriptions and all with the data split or data distribution that eric used.

Completion Criteria:

All the Tibetan OCR data uploaded to Openpecha hugging face.

Subtasks:

  • Lhasa Kanjur
  • Lithang Kanjur
  • Derge Tenjur
  • Norbuketaka
  • Google Books
  • Betsug data
  • Durtsa data
  • update the script for the special case of google books and norbuketaka data

note:
for the Norbuketaka and Google books, we already have a hugging face repo but without the data distributions so I am using that hugging face repo to create the new hugging face repo on Openpecha hugging face with the data distribution but without the zipped image file

Card Reviewer:

@ta4tsering ta4tsering self-assigned this Nov 27, 2024
@ta4tsering ta4tsering converted this from a draft issue Nov 27, 2024
@ta4tsering
Copy link
Contributor Author

@kaldan007
Copy link

@ta4tsering kindly reach out to @gangagyatso4364 regarding how to combine the uchen dataset in one hugging face dataset.

@kaldan007
Copy link

Kindly add url to the dataset

@ta4tsering
Copy link
Contributor Author

https://huggingface.co/datasets/openpecha/OCR-Norbuketaka
above is the norbuketaka hugging face repo and below is the missing images from the data.
norbuketaka_missing.txt
and below is the data distribution used by eric.
norbuketaka data distribution

@ta4tsering
Copy link
Contributor Author

ta4tsering commented Dec 11, 2024

I wasnt able to do for the Derge Tenjur, it is taking way too long to fix the issue which is that the images arent present in the zip as it is compromised when downloaded from the hugging face and since it is not on the s3 I cant use the url as well.

@kaldan007 kaldan007 moved this from IN PROGRESS to DONE in OCR Dev Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: DONE
Development

No branches or pull requests

2 participants