Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rules & disallowed words for Catalan #42

Merged
merged 9 commits into from
Aug 13, 2019
Merged

Conversation

jaumeortola
Copy link
Contributor

The list of disallowed words has been created with the help of a dictionary.

@jaumeortola
Copy link
Contributor Author

Results will be better if issue #41 is fixed.

@nukeador
Copy link

nukeador commented Aug 5, 2019

Great, thanks for this.

  • Can you comment how many sentences are you getting with this rules?
  • Did you check with at least a couple more Catalan speakers sharing a few small samples (100-500 sentences) to estimate the error rate?
  • What's the estimated error rate?

Thanks!

@jaumeortola
Copy link
Contributor Author

I get:

  • 483,756 sentences, including repetitions.
  • 397,817 sentences without repetitions.
  • I would discard around 8% of sentences, checking with a speller & grammar checker. Most wrong sentences are due to: wrong sentence tokenization, proper nouns, some minor typographical issues (unclosed quotation marks), a few typos... Remember that blacklist is not working as expected because of word tokenization (Disallowed words, boundaries, case #41).

Using LanguageTool, I get 566,792 sentences properly checked, with a very low error rate.

@nukeador
Copy link

nukeador commented Aug 5, 2019

  • Proper nouns are OK as long as people can pronounce them (basically they don't have letters, or combinations of letter, you can't pronounce). Examples: William, Gregor, Google, Burger King, Jefferson...
  • Repetitions can be solved once Dupe detection #14 is fixed.
  • Unclosed quotation marks is something we can probably just filter out, what do you think?

Thanks for all the feedback!

@jaumeortola
Copy link
Contributor Author

If you fix #14 and #41, the results can be acceptable.

If you are not going to fix the word tokenization mentioned on #41, tell me and I will rewrite the blacklist accordingly.

@jaumeortola
Copy link
Contributor Author

Do not make wrong assumptions. Of the 5 proper nouns you mention (William, Gregor, Google, Burger King, Jefferson), 4 are already in the dictionary I'm using. So we are not talking about this kind of (usual) foreign names. We are talking about unusual and frequently unpronounceable proper nouns.

@nukeador
Copy link

nukeador commented Aug 6, 2019

Understood now. Can you fill an issue to track the possible dictionary support?

That together with #14 and #41 seem to define a clear path for an improved script.

Do you want to merge this PR as it is now or do you want to wait until other improvements are applied?

@jaumeortola
Copy link
Contributor Author

Don't merge this PR. I will wait.

@jaumeortola
Copy link
Contributor Author

jaumeortola commented Aug 8, 2019

Please, merge the pull-request now. The rules and the blacklist are OK.

It is still necessary to remove duplicates, or even better to avoid them during the selection of sentences.

We would like you to run the script and to upload the resulting sentences as soon as possible. New sentences are very much needed for the people making recordings.

@nukeador
Copy link

nukeador commented Aug 8, 2019

OK.

For the sake of the process, since we want to expand this to a lot of languages and we need to ensure peer-review, can I ask you to point me to a place where a few samples of the output have been discussed with other native speakers and the estimated error ratio?

Once we automate this process we want to have a step where we make sure enough native speakers agree with the samples.

Thanks!

@jaumeortola
Copy link
Contributor Author

The sentences are here:

  • With Mozilla script. 473,366 sentences. There are 90.000 duplicates to be removed. I'll ask for opinions.
  • With LanguageTool. 566,792 sentences, without duplicates, extracted and checked with this script. We would be happier if these can be used,

@nukeador
Copy link

nukeador commented Aug 8, 2019

I know but as we talked, in order to be able to support everyone and complain with legal constrains we need to run the tools ourselves, that's why we are asking for support to improve the existing one.

Let me know about the feedback, we are still trying to figure out the best way to verify feedback from any other communities in the future.

@jordis
Copy link

jordis commented Aug 8, 2019

* [With Mozilla script](https://github.com/Softcatala/ca-text-corpus/blob/master/data/wiki.ca-mozilla_script.txt). 473,366 sentences. There are 90.000 duplicates to be removed. I'll ask for opinions.

I read around a couple of hundred sentences browsing through the file (in no particular order).
Found issues:

  • Duplicates should not be allowed in the collection process. Removing 90k duplicates from this file means that we will end up with 90k fewer sentences than we could have if the script wouldn't allow duplicates.
  • Sentences ending in semicolon (";") (31 sentences, duplicates included): In these sentences, the semicolon ending seems to be a mistake and the right ending punctuation should be a colon. I think the script should filter out sentences ending with semicolon during the collection process. Actually, I would filter out sentences that contain a semicolon anywhere.
  • Sentences containing a colon (":") - there are ~3000 sentences containing a colon in the middle of the sentence. I would filter them out too in the collection process. Some of them are too schematic, and they do not reflect real speech. Although they might be grammatically correct, I would filter them out too and collect other sentences that do not contain colons.
  • Sentences with periods in the middle of the sentence (including ellipsis ...) should be left out from collection.
  • Sentences ending in comma (",") should be left out from collection.

@xavivars
Copy link

xavivars commented Aug 8, 2019

I also looked into the sentences, trying to do a more qualitative than a quantitative analysis:

  • There is a significant amount of sentences that include scientific terms (antròpiques, biocenosi, nucleòtids, isoprè, ...) that should be easy to pronounce, but may be unknown to most of the speakers.
  • Very little unknown proper names (Gwalior, Peshawar, Rohingya, Argyle,...) that speakers may not know how to pronounce.

But in general, it looks good enough to import them.

@nukeador
Copy link

nukeador commented Aug 8, 2019

I would say both colon and scientific terms are OK. Most complex terms are usually caught by the blacklist. Where did you put the limit for the word frequency to generate the list? For example in Spanish we used 80 repetitions or less.

@jmontane
Copy link

jmontane commented Aug 8, 2019

I read around sentences and their quality is really good.

  • As pointed, there are some unkown/foreign proper names, but these sentences can be removed after import.
  • There are some sentences with wrongly encoded L·L. This issue should be fixed before import. AFAIK, @jaumeortola is working on it.

@nukeador
Copy link

nukeador commented Aug 8, 2019

@MichaelKohler @Gregoor Once we merge this PR, is this something you can own the extraction and PR to voice-web? I'm going on PTO for a couple of weeks and I don't want to be a blocker.

@MichaelKohler
Copy link
Member

I can't promise anything. Pretty busy next week due to me moving, but maybe the weekend of August 17th might be possible.

@jaumeortola
Copy link
Contributor Author

Where did you put the limit for the word frequency to generate the list? For example in Spanish we used 80 repetitions or less.

I didn't use this procedure. I built the blacklist using a dictionary. I'm trying now dictionary + word frequencies.

@Gregoor
Copy link
Contributor

Gregoor commented Aug 9, 2019

@MichaelKohler @Gregoor Once we merge this PR, is this something you can own the extraction and PR to voice-web? I'm going on PTO for a couple of weeks and I don't want to be a blocker.

I should be able to find time for this 👍

@nukeador
Copy link

nukeador commented Aug 9, 2019

OK @jaumeortola try with the word frequency we describe in the readme to see if results improve (to generate a few random samples of 500 sentences you don't need the script to finish all the process)

Once you and a few other native speakers are satisfied, please comment here with the estimated error rate, how many sentences you are getting and how did you generate the blacklist (for reference).

@Gregoor please merge when the previous is done and run the scrapping based on these rules and move to voice-web. Thanks so much! :-)

@jaumeortola
Copy link
Contributor Author

We are satisfied with the results now. So you can go ahead, @Gregoor.

We get 449,163 sentences without duplicates.
Around 4% of the sentences have some minor grammar error. We'll ask to remove them via pull-request.

The blacklist was generated using a language dictionary. A minimum frequency of 5 was set for disallowing some proper nouns present in the dictionary but very rare.

@Gregoor Gregoor merged commit 1ffde0a into common-voice:master Aug 13, 2019
@Gregoor
Copy link
Contributor

Gregoor commented Aug 14, 2019

Just an update: It's still running on my old laptop, we really should find out why this is so slow 🙈
There was a syntax error in the config, I'm not sure if you just fixed it locally on your machine @jaumeortola or how it ran for you. After the run finished on my machine I'll push a commit that fixes it.

@nukeador
Copy link

Thanks everyone involved here, I'm happy to see that Catalan has now the wikisentences merged. This is also really helpful for driving this effort in other languages :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants