Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial rules and blocklist for Sakha language #180

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gaydmi
Copy link

@gaydmi gaydmi commented Jan 9, 2023

This is an adaptation of the Belarusian rules to Sakha.

    How many sentences did you get at the end?

14786 sentences.

    How did you create the blocklist file?

As the dataset is quite small, the frequency threshold is set to 1. Mostly, the filtering was done

    Get at least 3 different native speakers (ideally linguists) to review a random sample of 100-500 sentences and estimate the average error ratio and comment (or link their comment) in the PR.

I've sampled 300 sentences randomly and split them into 3 samples of 100 sentences each.

As I'm not a native speaker of Sakha myself, I've contacted some members of Common Voice Sakha community.
The results could be found here:
Sample 1
Sample 2
Sample 3

So, the error rate is less than 5%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant