added hindi language toml and wiki sample #89

karthiksibm · 2020-02-04T08:40:53Z

How many sentences did you get at the end?
4500 lines on output

How did you create the blacklist file?
removed all characters from English language.

Review
For review please use sample file wiki.hi.txt.

MichaelKohler

Thanks. A few comments and suggestions:

Are there any Hindi script specific symbols we might not want? (I have no idea about hindi)
Are there Hindi specific abbreviation patterns?
Did you check if some of the newer rules might be helpful such as even_symbols or replacements?
Did you run the blacklist generation script as referenced in the Readme? For other languages not allowing less often used words greatly increased the quality as we could remove less used foreign words and foreign names
How many sentences did you get in total? I assume 4500 is just for the review?

Happy to help as much as I can, probably mostly on the technical side as I don't know Hindi at all.

MichaelKohler · 2020-02-04T14:13:21Z

Also, can you remove the sample file and add it somewhere online? We eventually do not want this as part of the source code here.

karthiksibm · 2020-03-04T10:05:43Z

Thanks, Michael. Responses to your questions below.

Are there any Hindi script specific symbols we might not want? (I have no idea about hindi)
None. We want all Hindi symbols included.
Are there Hindi specific abbreviation patterns?
Nothing different for Hindi.
Did you check if some of the newer rules might be helpful such as even_symbols or replacements?
Thanks. Yes. I have now included a replacement rule. In fact, this replaces Hindi's period "।" symbol (indicating end of Hindi sentence) with the standard period "." symbol. And I need this replacer to run before the SentenceTokenizer in extractor.rs so that each piece of text is broken up into sentences correctly before the rules are checked. That is what there is a slight code change requested in extractor.rs at line 108. Can you please check if this is OK?
Did you run the blacklist generation script as referenced in the Readme? For other languages not allowing less often used words greatly increased the quality as we could remove less used foreign words and foreign names
Thanks. Yes, I have included a long list of less frequently occuring words in the disallowed_words/hindi.txt file. These are about 150K words in this list.
How many sentences did you get in total? I assume 4500 is just for the review?
We get around 90K sentences. This is with max_sentences_per_text=3. I also tried with max_sentences_per_text=50 and we can a 10X larger set which is also good. How do I make this possible via config?

Thanks.

MichaelKohler · 2020-03-04T21:13:59Z

Thanks for your answers!

Yes, I have included a long list of less frequently occuring words in the disallowed_words/hindi.txt file. These are about 150K words in this list.

Which limit did you choose?

We get around 90K sentences. This is with max_sentences_per_text=3. I also tried with max_sentences_per_text=50 and we can a 10X larger set which is also good. How do I make this possible via config?

A maximum of 3 sentences per article is a legal requirement, we can't go higher than that.

Can I also ask you to do the following, to make sure you can profit from the automatic sample extraction we just introduced?

Update your branch with the latest code from the master branch
Rename src/rules/hindi.toml to src/rules/hi.toml
Rename src/rules/disallowed_words/hindi.toml to src/rules/disallowed_words/hi.toml

Also note that the local command for extraction will now be:

cargo run -- extract -l hi -d path/to/files

Happy to answer any question you may have and thanks for your efforts!

I'll comment on the change in extractor.rs and some other things separately.

src/rules/hindi.toml

src/extractor.rs

Pulling latest changes from Common voice master

MichaelKohler · 2020-03-09T17:20:41Z

@karthiksibm can you please also have a look at the other comments I've made?

…wiki-scraper getting latest commits

karthiksibm · 2020-03-09T17:36:54Z

I've made the updates to hi.toml. Thanks for your comments.

MichaelKohler

Thanks for the additions, this starts to look really good! I have two more comments.

src/extractor.rs

src/rules/hi.toml

…Updated disallowed symbols in toml

MichaelKohler

Thanks for all the changes. This looks good to me from a technical perspective. Let's see if the error rate is good as well.

karthiksibm · 2020-03-12T14:33:19Z

Error Rate Review:

Reviewer 1 - error rate: 10%
https://docs.google.com/spreadsheets/d/1WoGyQH4ZW9f_N4FhHEOEdHQ4XoB0r2NAzgETSGPdFO4/edit#gid=0
Reviewer 2 - error rate: 12%
https://docs.google.com/spreadsheets/d/1WYwPogPW3BRh3BYpoVquK-CGRod3HreHrtfVZsl2vwY/edit#gid=0
Reviewer 3 - error rate: 20%
https://docs.google.com/spreadsheets/d/1Rpf6JC5QqiNwBJWPnCRqf3Hi1sRzJIHEoENbE3IBsZw/edit#gid=0
Reviewer 4 - error rate: 26%
https://docs.google.com/spreadsheets/d/1ByQ5o3wtE7tm1ieedC9IFPM0a6Y0K1p-KFgDj-B2uuU/edit#gid=0

MichaelKohler · 2020-03-12T18:23:15Z

These numbers are a bit too high. @nukeador I forgot what the required minimum was, can you remind me?

Can you look at the sentences and see if you can

identify common words that could be added to the blacklist?
consider decreasing the minimum frequency for the blacklist?
find any other common wrong patterns that could be added to the rules? (you can also use the abbreviation pattern for other stuff, check the German (de) one for examples)

Thanks for your efforts!

MichaelKohler · 2020-03-12T20:26:29Z

The error rate should be between 5-7%. Anything lower of course is great, but probably very hard to achieve.

karthiksibm · 2020-03-14T07:21:47Z

Thanks @MichaelKohler . Looks like it has too many complicated, long words which make them hard to pronounce.

To filter out such long words, is there a parameter to set the max_characters per word or max_trimmed_length, like the opposite of min_characters or min_trimmed_length that we have?

Meanwhile, I will play around to try and catch them into a better blacklist words set.

MichaelKohler · 2020-03-14T11:00:55Z

To filter out such long words, is there a parameter to set the max_characters per word or max_trimmed_length, like the opposite of min_characters or min_trimmed_length that we have?

There is currently no such setting, but you could use a Regex in the abbreviations_patterns section to filter those out. I'll have a quick look if I can come up with a regex.

MichaelKohler · 2020-03-14T11:50:08Z

@karthiksibm it seems you can use \\w{5,50} in the abbreviation_patterns to exclude any words larger than 4 words. Adjust 5 to the maximum of characters per word that should be excluded. Would be great if you could add a comment in the file explaining that, otherwise we might wonder in the future why that is :)

MichaelKohler · 2020-03-14T12:32:23Z

If you merge latest master into your branch, you can also use the other_patterns config rule to add that, then it's not so confusing as that's not really an abbreviation pattern.

MichaelKohler · 2020-03-14T17:34:29Z

Looks like it has too many complicated, long words which make them hard to pronounce.

.. but those words are still appearing with a high frequency? If not, increasing the minimum frequency of the blacklist might also be a way to go

karthiksibm · 2020-03-18T17:33:10Z

Error Rate Review:

Reviewer 1 - error rate: 3%
https://docs.google.com/spreadsheets/d/1l_5bP01ggRbVcwBosBcTCT0zkB6cABPTsnQzKAJdlJU/edit?usp=sharing
Reviewer 2 - error rate: 8%
https://docs.google.com/spreadsheets/d/1xiWdSXmPOWdxTYnPuGLvp4s3CiRWoYq3BBaqMwO_Dug/edit?usp=sharing
Reviewer 3 - error rate: 8%
https://docs.google.com/spreadsheets/d/1FEBKn2Z3jEr93jGdpkz3kLxfzXQL1ClTUwU8amraHoo/edit?usp=sharing

Looks better now by improving the blacklist.

MichaelKohler · 2020-03-18T18:44:36Z

Thanks, that looks better. How did you improve the blacklist? Which maximum frequency were you using before, and how much now? Also, does this PR include the latest list? I'm not seeing a recent commit.

I've just saw the following sentence:

राज घाट, नई दिल्ली, में गांधी जी के स्मारक पर "देवनागरी में " हे राम " लिखा हुआ है.

You might want to have a look at the even_symbols config. With that this should be easy to catch.

nukeador · 2020-05-08T11:17:21Z

@karthiksibm did you have the change to check the last question here? Is this still just producing 4500 sentences?

Feel free to join our matrix chat so we can support you to get more sentences, my understanding is that Hindi wikipedia has 180K articles, it's weird you are only getting 90K sentences.

https://chat.mozilla.org/#/room/#common-voice-sentence-extractor:mozilla.org

MichaelKohler · 2020-05-08T11:20:02Z

@karthiksibm did you have the change to check the last question here? Is this still just producing 4500 sentences?

Check #89 (comment) where the answer is "We get around 90K sentences."

However, the following questions should still be answered before we proceed here. I'm mostly worried about not having a recent commit for the blacklist change.

How did you improve the blacklist? Which maximum frequency were you using before, and how much now? Also, does this PR include the latest list? I'm not seeing a recent commit.

karthiksibm · 2020-05-08T11:26:22Z

@MichaelKohler sorry I got busy with other projects. I'll get back quickly with the answer to your question.

karthiksibm · 2020-05-08T12:04:20Z

@MichaelKohler checked in the latest rules and the blacklist file. The blacklist was obtained with frequency of 50 and also including words longer than 9 characters. That resulted in improved readability.

MichaelKohler · 2020-05-08T13:05:09Z

src/rules/hi.toml

@@ -0,0 +1,21 @@
+min_trimmed_length = 3
+min_word_count = 5


Can you elaborate the change from 1 to 5 here? How do smaller sentences below 5 words look like? Would there be some that would be valid?

Having a lower limit will result in lot of sentences smaller than 5 words, which will look like:
यह बड़ा है (this is big)
और कहाँ है (where else is this)
क्या मिलेगा (what will you get)
..and so on

These don't seem to be very useful and will dominate the resulting dataset. What do you think?

Smaller sentences are also OK if they make sense. How many are we losing when this is applied, what's the total with and without these sentences?

Oymate · 2021-05-19T06:44:40Z

@MichaelKohler

MichaelKohler

This PR will need to be updated to not have merge conflicts, and there are still open questions such as nukeador's question around losing sentences. Additionally it's worth to invest a bit more time to bring down the error rate (preferably < 5%).

added hindi language toml and wiki sample

4c47b54

karthiksibm requested a review from MichaelKohler February 4, 2020 08:41

MichaelKohler requested changes Feb 4, 2020

View reviewed changes

MichaelKohler added the waiting on feedback label Mar 1, 2020

added blacklist file, updates toml, removed sample sentence file

1e8a10b

karthiksibm requested a review from MichaelKohler March 4, 2020 10:16

MichaelKohler requested changes Mar 4, 2020

View reviewed changes

karthiksibm and others added 3 commits March 5, 2020 15:33

changed hindi file names to hi

baf9585

remove old named hindi files

d6b5dd5

Merge pull request #1 from Common-Voice/master

baae66e

Pulling latest changes from Common voice master

karthiksibm requested a review from MichaelKohler March 9, 2020 17:19

karthiksibm added 2 commits March 9, 2020 23:03

updates to hi.toml

25e3e07

Merge branch 'master' of https://github.com/karthiksibm/common-voice-…

79757bd

…wiki-scraper getting latest commits

MichaelKohler requested changes Mar 9, 2020

View reviewed changes

src/extractor.rs Outdated Show resolved Hide resolved

src/rules/hi.toml Show resolved Hide resolved

karthiksibm added 2 commits March 10, 2020 09:36

updated extractor.rs to run replacer rules before SentenceTokenizer. …

77d6208

…Updated disallowed symbols in toml

hi.toml update

99b9730

karthiksibm requested a review from MichaelKohler March 10, 2020 04:10

MichaelKohler approved these changes Mar 10, 2020

View reviewed changes

MichaelKohler added waiting on error rate review and removed waiting on feedback labels Mar 10, 2020

karthiksibm requested a review from MichaelKohler March 12, 2020 14:34

MichaelKohler removed their request for review March 12, 2020 18:21

karthiksibm requested a review from MichaelKohler March 18, 2020 17:33

latest rules and blacklist words with frequency 50

2dd5eab

MichaelKohler reviewed May 8, 2020

View reviewed changes

karthiksibm requested a review from MichaelKohler May 8, 2020 14:36

MichaelKohler marked this pull request as draft September 1, 2020 16:10

MichaelKohler changed the base branch from master to main October 27, 2020 17:42

MichaelKohler requested changes May 19, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added hindi language toml and wiki sample #89

added hindi language toml and wiki sample #89

karthiksibm commented Feb 4, 2020

MichaelKohler left a comment

MichaelKohler commented Feb 4, 2020

karthiksibm commented Mar 4, 2020

MichaelKohler commented Mar 4, 2020

MichaelKohler commented Mar 9, 2020

karthiksibm commented Mar 9, 2020

MichaelKohler left a comment

MichaelKohler left a comment

karthiksibm commented Mar 12, 2020

MichaelKohler commented Mar 12, 2020

MichaelKohler commented Mar 12, 2020

karthiksibm commented Mar 14, 2020

MichaelKohler commented Mar 14, 2020

MichaelKohler commented Mar 14, 2020

MichaelKohler commented Mar 14, 2020

MichaelKohler commented Mar 14, 2020

karthiksibm commented Mar 18, 2020

MichaelKohler commented Mar 18, 2020

nukeador commented May 8, 2020 •

edited

Loading

MichaelKohler commented May 8, 2020

karthiksibm commented May 8, 2020

karthiksibm commented May 8, 2020

MichaelKohler May 8, 2020

karthiksibm May 8, 2020

nukeador May 8, 2020

Oymate commented May 19, 2021

MichaelKohler left a comment •

edited

Loading

added hindi language toml and wiki sample #89

Are you sure you want to change the base?

added hindi language toml and wiki sample #89

Conversation

karthiksibm commented Feb 4, 2020

MichaelKohler left a comment

Choose a reason for hiding this comment

MichaelKohler commented Feb 4, 2020

karthiksibm commented Mar 4, 2020

MichaelKohler commented Mar 4, 2020

MichaelKohler commented Mar 9, 2020

karthiksibm commented Mar 9, 2020

MichaelKohler left a comment

Choose a reason for hiding this comment

MichaelKohler left a comment

Choose a reason for hiding this comment

karthiksibm commented Mar 12, 2020

MichaelKohler commented Mar 12, 2020

MichaelKohler commented Mar 12, 2020

karthiksibm commented Mar 14, 2020

MichaelKohler commented Mar 14, 2020

MichaelKohler commented Mar 14, 2020

MichaelKohler commented Mar 14, 2020

MichaelKohler commented Mar 14, 2020

karthiksibm commented Mar 18, 2020

MichaelKohler commented Mar 18, 2020

nukeador commented May 8, 2020 • edited Loading

MichaelKohler commented May 8, 2020

karthiksibm commented May 8, 2020

karthiksibm commented May 8, 2020

MichaelKohler May 8, 2020

Choose a reason for hiding this comment

karthiksibm May 8, 2020

Choose a reason for hiding this comment

nukeador May 8, 2020

Choose a reason for hiding this comment

Oymate commented May 19, 2021

MichaelKohler left a comment • edited Loading

Choose a reason for hiding this comment

nukeador commented May 8, 2020 •

edited

Loading

MichaelKohler left a comment •

edited

Loading