rules & disallowed words for Catalan #42

jaumeortola · 2019-08-02T22:42:21Z

The list of disallowed words has been created with the help of a dictionary.

jaumeortola · 2019-08-02T22:48:05Z

Results will be better if issue #41 is fixed.

nukeador · 2019-08-05T10:51:40Z

Great, thanks for this.

Can you comment how many sentences are you getting with this rules?
Did you check with at least a couple more Catalan speakers sharing a few small samples (100-500 sentences) to estimate the error rate?
What's the estimated error rate?

Thanks!

jaumeortola · 2019-08-05T13:35:43Z

I get:

483,756 sentences, including repetitions.
397,817 sentences without repetitions.
I would discard around 8% of sentences, checking with a speller & grammar checker. Most wrong sentences are due to: wrong sentence tokenization, proper nouns, some minor typographical issues (unclosed quotation marks), a few typos... Remember that blacklist is not working as expected because of word tokenization (Disallowed words, boundaries, case #41).

Using LanguageTool, I get 566,792 sentences properly checked, with a very low error rate.

nukeador · 2019-08-05T13:54:46Z

Proper nouns are OK as long as people can pronounce them (basically they don't have letters, or combinations of letter, you can't pronounce). Examples: William, Gregor, Google, Burger King, Jefferson...
Repetitions can be solved once Dupe detection #14 is fixed.
Unclosed quotation marks is something we can probably just filter out, what do you think?

Thanks for all the feedback!

jaumeortola · 2019-08-05T14:06:46Z

If you fix #14 and #41, the results can be acceptable.

If you are not going to fix the word tokenization mentioned on #41, tell me and I will rewrite the blacklist accordingly.

jaumeortola · 2019-08-05T22:19:06Z

Do not make wrong assumptions. Of the 5 proper nouns you mention (William, Gregor, Google, Burger King, Jefferson), 4 are already in the dictionary I'm using. So we are not talking about this kind of (usual) foreign names. We are talking about unusual and frequently unpronounceable proper nouns.

nukeador · 2019-08-06T11:07:03Z

Understood now. Can you fill an issue to track the possible dictionary support?

That together with #14 and #41 seem to define a clear path for an improved script.

Do you want to merge this PR as it is now or do you want to wait until other improvements are applied?

jaumeortola · 2019-08-06T11:55:04Z

Don't merge this PR. I will wait.

jaumeortola · 2019-08-08T10:23:03Z

Please, merge the pull-request now. The rules and the blacklist are OK.

It is still necessary to remove duplicates, or even better to avoid them during the selection of sentences.

We would like you to run the script and to upload the resulting sentences as soon as possible. New sentences are very much needed for the people making recordings.

nukeador · 2019-08-08T11:15:55Z

OK.

For the sake of the process, since we want to expand this to a lot of languages and we need to ensure peer-review, can I ask you to point me to a place where a few samples of the output have been discussed with other native speakers and the estimated error ratio?

Once we automate this process we want to have a step where we make sure enough native speakers agree with the samples.

Thanks!

jaumeortola · 2019-08-08T11:54:37Z

The sentences are here:

With Mozilla script. 473,366 sentences. There are 90.000 duplicates to be removed. I'll ask for opinions.
With LanguageTool. 566,792 sentences, without duplicates, extracted and checked with this script. We would be happier if these can be used,

nukeador · 2019-08-08T12:17:04Z

I know but as we talked, in order to be able to support everyone and complain with legal constrains we need to run the tools ourselves, that's why we are asking for support to improve the existing one.

Let me know about the feedback, we are still trying to figure out the best way to verify feedback from any other communities in the future.

jordis · 2019-08-08T13:51:03Z

* [With Mozilla script](https://github.com/Softcatala/ca-text-corpus/blob/master/data/wiki.ca-mozilla_script.txt). 473,366 sentences. There are 90.000 duplicates to be removed. I'll ask for opinions.

I read around a couple of hundred sentences browsing through the file (in no particular order).
Found issues:

Duplicates should not be allowed in the collection process. Removing 90k duplicates from this file means that we will end up with 90k fewer sentences than we could have if the script wouldn't allow duplicates.
Sentences ending in semicolon (";") (31 sentences, duplicates included): In these sentences, the semicolon ending seems to be a mistake and the right ending punctuation should be a colon. I think the script should filter out sentences ending with semicolon during the collection process. Actually, I would filter out sentences that contain a semicolon anywhere.
Sentences containing a colon (":") - there are ~3000 sentences containing a colon in the middle of the sentence. I would filter them out too in the collection process. Some of them are too schematic, and they do not reflect real speech. Although they might be grammatically correct, I would filter them out too and collect other sentences that do not contain colons.
Sentences with periods in the middle of the sentence (including ellipsis ...) should be left out from collection.
Sentences ending in comma (",") should be left out from collection.

xavivars · 2019-08-08T14:15:09Z

I also looked into the sentences, trying to do a more qualitative than a quantitative analysis:

There is a significant amount of sentences that include scientific terms (antròpiques, biocenosi, nucleòtids, isoprè, ...) that should be easy to pronounce, but may be unknown to most of the speakers.
Very little unknown proper names (Gwalior, Peshawar, Rohingya, Argyle,...) that speakers may not know how to pronounce.

But in general, it looks good enough to import them.

nukeador · 2019-08-08T19:44:49Z

I would say both colon and scientific terms are OK. Most complex terms are usually caught by the blacklist. Where did you put the limit for the word frequency to generate the list? For example in Spanish we used 80 repetitions or less.

jmontane · 2019-08-08T20:29:29Z

I read around sentences and their quality is really good.

As pointed, there are some unkown/foreign proper names, but these sentences can be removed after import.
There are some sentences with wrongly encoded L·L. This issue should be fixed before import. AFAIK, @jaumeortola is working on it.

nukeador · 2019-08-08T20:38:26Z

@MichaelKohler @Gregoor Once we merge this PR, is this something you can own the extraction and PR to voice-web? I'm going on PTO for a couple of weeks and I don't want to be a blocker.

MichaelKohler · 2019-08-08T23:41:55Z

I can't promise anything. Pretty busy next week due to me moving, but maybe the weekend of August 17th might be possible.

jaumeortola · 2019-08-09T09:02:52Z

Where did you put the limit for the word frequency to generate the list? For example in Spanish we used 80 repetitions or less.

I didn't use this procedure. I built the blacklist using a dictionary. I'm trying now dictionary + word frequencies.

Gregoor · 2019-08-09T11:48:26Z

@MichaelKohler @Gregoor Once we merge this PR, is this something you can own the extraction and PR to voice-web? I'm going on PTO for a couple of weeks and I don't want to be a blocker.

I should be able to find time for this 👍

nukeador · 2019-08-09T12:43:54Z

OK @jaumeortola try with the word frequency we describe in the readme to see if results improve (to generate a few random samples of 500 sentences you don't need the script to finish all the process)

Once you and a few other native speakers are satisfied, please comment here with the estimated error rate, how many sentences you are getting and how did you generate the blacklist (for reference).

@Gregoor please merge when the previous is done and run the scrapping based on these rules and move to voice-web. Thanks so much! :-)

jaumeortola · 2019-08-10T12:32:21Z

We are satisfied with the results now. So you can go ahead, @Gregoor.

We get 449,163 sentences without duplicates.
Around 4% of the sentences have some minor grammar error. We'll ask to remove them via pull-request.

The blacklist was generated using a language dictionary. A minimum frequency of 5 was set for disallowing some proper nouns present in the dictionary but very rare.

Gregoor · 2019-08-14T10:16:20Z

Just an update: It's still running on my old laptop, we really should find out why this is so slow 🙈
There was a syntax error in the config, I'm not sure if you just fixed it locally on your machine @jaumeortola or how it ran for you. After the run finished on my machine I'll push a commit that fixes it.

nukeador · 2019-08-26T13:18:12Z

Thanks everyone involved here, I'm happy to see that Catalan has now the wikisentences merged. This is also really helpful for driving this effort in other languages :D

rules & disallowed words for Catalan

18d9f45

ca: more disallowed symbols

c85b6a0

update Catalan config

e90ab31

jaumeortola added 3 commits August 8, 2019 19:19

update Catalan settings

9c56a94

missing comma

099b761

fix escape

91e9085

jaumeortola added 3 commits August 9, 2019 18:23

Catalan: update rules & disallowed words

3a6572c

+disallowed symbols

95b5e0e

Catalan: update rules

387f05e

Gregoor merged commit 1ffde0a into common-voice:master Aug 13, 2019

Gregoor mentioned this pull request Aug 14, 2019

catalan wiki sentences common-voice/common-voice#2164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rules & disallowed words for Catalan #42

rules & disallowed words for Catalan #42

jaumeortola commented Aug 2, 2019

jaumeortola commented Aug 2, 2019

nukeador commented Aug 5, 2019

jaumeortola commented Aug 5, 2019

nukeador commented Aug 5, 2019

jaumeortola commented Aug 5, 2019

jaumeortola commented Aug 5, 2019

nukeador commented Aug 6, 2019

jaumeortola commented Aug 6, 2019

jaumeortola commented Aug 8, 2019 •

edited

Loading

nukeador commented Aug 8, 2019

jaumeortola commented Aug 8, 2019

nukeador commented Aug 8, 2019

jordis commented Aug 8, 2019 •

edited

Loading

xavivars commented Aug 8, 2019

nukeador commented Aug 8, 2019 •

edited

Loading

jmontane commented Aug 8, 2019

nukeador commented Aug 8, 2019

MichaelKohler commented Aug 8, 2019

jaumeortola commented Aug 9, 2019

Gregoor commented Aug 9, 2019

nukeador commented Aug 9, 2019

jaumeortola commented Aug 10, 2019

Gregoor commented Aug 14, 2019

nukeador commented Aug 26, 2019

rules & disallowed words for Catalan #42

rules & disallowed words for Catalan #42

Conversation

jaumeortola commented Aug 2, 2019

jaumeortola commented Aug 2, 2019

nukeador commented Aug 5, 2019

jaumeortola commented Aug 5, 2019

nukeador commented Aug 5, 2019

jaumeortola commented Aug 5, 2019

jaumeortola commented Aug 5, 2019

nukeador commented Aug 6, 2019

jaumeortola commented Aug 6, 2019

jaumeortola commented Aug 8, 2019 • edited Loading

nukeador commented Aug 8, 2019

jaumeortola commented Aug 8, 2019

nukeador commented Aug 8, 2019

jordis commented Aug 8, 2019 • edited Loading

xavivars commented Aug 8, 2019

nukeador commented Aug 8, 2019 • edited Loading

jmontane commented Aug 8, 2019

nukeador commented Aug 8, 2019

MichaelKohler commented Aug 8, 2019

jaumeortola commented Aug 9, 2019

Gregoor commented Aug 9, 2019

nukeador commented Aug 9, 2019

jaumeortola commented Aug 10, 2019

Gregoor commented Aug 14, 2019

nukeador commented Aug 26, 2019

jaumeortola commented Aug 8, 2019 •

edited

Loading

jordis commented Aug 8, 2019 •

edited

Loading

nukeador commented Aug 8, 2019 •

edited

Loading