Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy cs to sk for prototyping #132

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Adrijaned
Copy link

Mainly to get the extraction running and to get an idea how much more work will need to be done.

@@ -0,0 +1,17 @@
allowed_symbols_regex="[A-Za-zěščřžýáíéóďťňúůĚŠČŘŽÝÁÍÉÓĎŤŇäöüÚ‚–\\. \"„“]"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allowed_symbols_regex="[A-Za-zěščŕřžýáíéóôďťňúůĺľÁÄĚŠČŔŘŽÝÁÍÉÓÔĎŤŇĹĽäöüÚ‚–\. "„“]"

needs_uppercase_start = true
even_symbols = ["\""]
broken_whitespace = [" ", " ,", " .", " ?", " !", " ;"]
abbreviation_patterns = ["[A-ZĚŠČŘŽÝÁÍÉĎŤŇÓÚ]+\\.*[a-z]*[A-ZĚŠČŘŽÝÁÍÉĎŤŇÓÚ]+", "atd\\.", "\\baj\\.", "tj\\.", "\\brec\\.", "[nN]apř\\.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abbreviation_patterns = ["[A-ZĹĽĚŠČŔŘŽÝÁÍÉĎŤŇÓÔÚ]+\.[a-z][A-ZĹĽĚŠČŔŘŽÝÁÍÉĎŤŇÓÔÚ]+", "a i\.", "a pod\.", "atď\.", "\baj\.", "tj\

.", "\brec\.", "[nN]apr\.",
""."", "\s[^aikosuvzáó]\s", "zkr\.", "[Tt]zv\.", "[dD]r\.", "\b[aAeE]d\.", "\b[sS]?[tT]r\.", "[aA]rch\.", "Inc\.", "Ltd\.", "[pP]opr\.",
"\b[fF]r\.", "\b[A-Z]+DR\b", "[pP]ozn\.", "[sS]rov\.", "\b[eE][a-z]\.", "[zZ]ejm\.", "[JS]r\.", "\b[lL][lL]",
"Mgr\.", "[mM]j\.", "\b[sS]tol\.", "\b[pP]ol\.", "Ing\.", "[cCkK]pt\.", "\b[lL]t\.", "Mr?s?\.", "\s[^\\s]{1,2}\.", "\bviz\.", "\b[sS]at\."]

@Adrijaned
Copy link
Author

Blocklist generated from words of frequency 60 and lower

@Hrano
Copy link

Hrano commented Dec 14, 2020

Downloaded and sent for review to five native speakers.
Corrects error sentences in the second column next to it, in xls format.
Will it be OK like this?

@MichaelKohler MichaelKohler marked this pull request as draft March 14, 2021 13:49
@MichaelKohler
Copy link
Member

Sorry, I missed that comment.

Corrects error sentences in the second column next to it, in xls format.
Will it be OK like this?

No. We can't accept corrected sentences, because we need to run a new, fresh export once the rules are added. This is needed to make sure that we fulfil all legal requirements. As sentences are picked at random, any changes to them would be lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants