Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added hindi language toml and wiki sample #89

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

karthiksibm
Copy link

How many sentences did you get at the end?
4500 lines on output

How did you create the blacklist file?
removed all characters from English language.

Review
For review please use sample file wiki.hi.txt.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. A few comments and suggestions:

  • Are there any Hindi script specific symbols we might not want? (I have no idea about hindi)
  • Are there Hindi specific abbreviation patterns?
  • Did you check if some of the newer rules might be helpful such as even_symbols or replacements?
  • Did you run the blacklist generation script as referenced in the Readme? For other languages not allowing less often used words greatly increased the quality as we could remove less used foreign words and foreign names
  • How many sentences did you get in total? I assume 4500 is just for the review?

Happy to help as much as I can, probably mostly on the technical side as I don't know Hindi at all.

@MichaelKohler
Copy link
Member

Also, can you remove the sample file and add it somewhere online? We eventually do not want this as part of the source code here.

@karthiksibm
Copy link
Author

Thanks, Michael. Responses to your questions below.

  • Are there any Hindi script specific symbols we might not want? (I have no idea about hindi)
    None. We want all Hindi symbols included.

  • Are there Hindi specific abbreviation patterns?
    Nothing different for Hindi.

  • Did you check if some of the newer rules might be helpful such as even_symbols or replacements?
    Thanks. Yes. I have now included a replacement rule. In fact, this replaces Hindi's period "।" symbol (indicating end of Hindi sentence) with the standard period "." symbol. And I need this replacer to run before the SentenceTokenizer in extractor.rs so that each piece of text is broken up into sentences correctly before the rules are checked. That is what there is a slight code change requested in extractor.rs at line 108. Can you please check if this is OK?

  • Did you run the blacklist generation script as referenced in the Readme? For other languages not allowing less often used words greatly increased the quality as we could remove less used foreign words and foreign names
    Thanks. Yes, I have included a long list of less frequently occuring words in the disallowed_words/hindi.txt file. These are about 150K words in this list.

  • How many sentences did you get in total? I assume 4500 is just for the review?
    We get around 90K sentences. This is with max_sentences_per_text=3. I also tried with max_sentences_per_text=50 and we can a 10X larger set which is also good. How do I make this possible via config?

Thanks.

@MichaelKohler
Copy link
Member

Thanks for your answers!

Yes, I have included a long list of less frequently occuring words in the disallowed_words/hindi.txt file. These are about 150K words in this list.

Which limit did you choose?

We get around 90K sentences. This is with max_sentences_per_text=3. I also tried with max_sentences_per_text=50 and we can a 10X larger set which is also good. How do I make this possible via config?

A maximum of 3 sentences per article is a legal requirement, we can't go higher than that.

Can I also ask you to do the following, to make sure you can profit from the automatic sample extraction we just introduced?

  • Update your branch with the latest code from the master branch
  • Rename src/rules/hindi.toml to src/rules/hi.toml
  • Rename src/rules/disallowed_words/hindi.toml to src/rules/disallowed_words/hi.toml

Also note that the local command for extraction will now be:

cargo run -- extract -l hi -d path/to/files

Happy to answer any question you may have and thanks for your efforts!

I'll comment on the change in extractor.rs and some other things separately.

src/rules/hindi.toml Outdated Show resolved Hide resolved
src/rules/hindi.toml Outdated Show resolved Hide resolved
src/rules/hindi.toml Outdated Show resolved Hide resolved
src/rules/hindi.toml Outdated Show resolved Hide resolved
src/rules/hindi.toml Outdated Show resolved Hide resolved
src/extractor.rs Outdated Show resolved Hide resolved
@MichaelKohler
Copy link
Member

@karthiksibm can you please also have a look at the other comments I've made?

@karthiksibm
Copy link
Author

I've made the updates to hi.toml. Thanks for your comments.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additions, this starts to look really good! I have two more comments.

src/extractor.rs Outdated Show resolved Hide resolved
src/rules/hi.toml Show resolved Hide resolved
Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes. This looks good to me from a technical perspective. Let's see if the error rate is good as well.

@MichaelKohler MichaelKohler removed their request for review March 12, 2020 18:21
@MichaelKohler
Copy link
Member

These numbers are a bit too high. @nukeador I forgot what the required minimum was, can you remind me?

Can you look at the sentences and see if you can

  • identify common words that could be added to the blacklist?
  • consider decreasing the minimum frequency for the blacklist?
  • find any other common wrong patterns that could be added to the rules? (you can also use the abbreviation pattern for other stuff, check the German (de) one for examples)

Thanks for your efforts!

@MichaelKohler
Copy link
Member

The error rate should be between 5-7%. Anything lower of course is great, but probably very hard to achieve.

@karthiksibm
Copy link
Author

Thanks @MichaelKohler . Looks like it has too many complicated, long words which make them hard to pronounce.

To filter out such long words, is there a parameter to set the max_characters per word or max_trimmed_length, like the opposite of min_characters or min_trimmed_length that we have?

Meanwhile, I will play around to try and catch them into a better blacklist words set.

@MichaelKohler
Copy link
Member

To filter out such long words, is there a parameter to set the max_characters per word or max_trimmed_length, like the opposite of min_characters or min_trimmed_length that we have?

There is currently no such setting, but you could use a Regex in the abbreviations_patterns section to filter those out. I'll have a quick look if I can come up with a regex.

@MichaelKohler
Copy link
Member

@karthiksibm it seems you can use \\w{5,50} in the abbreviation_patterns to exclude any words larger than 4 words. Adjust 5 to the maximum of characters per word that should be excluded. Would be great if you could add a comment in the file explaining that, otherwise we might wonder in the future why that is :)

@MichaelKohler
Copy link
Member

If you merge latest master into your branch, you can also use the other_patterns config rule to add that, then it's not so confusing as that's not really an abbreviation pattern.

@MichaelKohler
Copy link
Member

Looks like it has too many complicated, long words which make them hard to pronounce.

.. but those words are still appearing with a high frequency? If not, increasing the minimum frequency of the blacklist might also be a way to go

@karthiksibm
Copy link
Author

@MichaelKohler
Copy link
Member

Thanks, that looks better. How did you improve the blacklist? Which maximum frequency were you using before, and how much now? Also, does this PR include the latest list? I'm not seeing a recent commit.

I've just saw the following sentence:

राज घाट, नई दिल्ली, में गांधी जी के स्मारक पर "देवनागरी में " हे राम " लिखा हुआ है.

You might want to have a look at the even_symbols config. With that this should be easy to catch.

@nukeador
Copy link

nukeador commented May 8, 2020

@karthiksibm did you have the change to check the last question here? Is this still just producing 4500 sentences?

Feel free to join our matrix chat so we can support you to get more sentences, my understanding is that Hindi wikipedia has 180K articles, it's weird you are only getting 90K sentences.

https://chat.mozilla.org/#/room/#common-voice-sentence-extractor:mozilla.org

@MichaelKohler
Copy link
Member

@karthiksibm did you have the change to check the last question here? Is this still just producing 4500 sentences?

Check #89 (comment) where the answer is "We get around 90K sentences."

However, the following questions should still be answered before we proceed here. I'm mostly worried about not having a recent commit for the blacklist change.

How did you improve the blacklist? Which maximum frequency were you using before, and how much now? Also, does this PR include the latest list? I'm not seeing a recent commit.

@karthiksibm
Copy link
Author

@MichaelKohler sorry I got busy with other projects. I'll get back quickly with the answer to your question.

@karthiksibm
Copy link
Author

@MichaelKohler checked in the latest rules and the blacklist file. The blacklist was obtained with frequency of 50 and also including words longer than 9 characters. That resulted in improved readability.

@@ -0,0 +1,21 @@
min_trimmed_length = 3
min_word_count = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate the change from 1 to 5 here? How do smaller sentences below 5 words look like? Would there be some that would be valid?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a lower limit will result in lot of sentences smaller than 5 words, which will look like:
यह बड़ा है (this is big)
और कहाँ है (where else is this)
क्या मिलेगा (what will you get)
..and so on

These don't seem to be very useful and will dominate the resulting dataset. What do you think?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smaller sentences are also OK if they make sense. How many are we losing when this is applied, what's the total with and without these sentences?

@MichaelKohler MichaelKohler marked this pull request as draft September 1, 2020 16:10
@MichaelKohler MichaelKohler changed the base branch from master to main October 27, 2020 17:42
@Oymate
Copy link

Oymate commented May 19, 2021

@MichaelKohler

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR will need to be updated to not have merge conflicts, and there are still open questions such as nukeador's question around losing sentences. Additionally it's worth to invest a bit more time to bring down the error rate (preferably < 5%).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants