Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pt] Use new POS tagging schema #10375

Merged
merged 13 commits into from
Apr 8, 2024
Merged

[pt] Use new POS tagging schema #10375

merged 13 commits into from
Apr 8, 2024

Conversation

susanaboatto
Copy link
Collaborator

@susanaboatto susanaboatto commented Mar 7, 2024

⚠️ This change requires the portuguese-pos-dict dependency to be v0.13 or later!

tl;dr

This PR adds POS tags to the pt tagset for enclitic pronouns.

What changes

Let's use diz-me ('tell me') to illustrate the changes.

was will be
three tokens (diz, -, me) a single token (diz-me)
each token tagged separately tagged as dizer[VMM02S0:PP1CSO00]

Consequences

  • dictionary suggestions consider whole word forms (e.g. ama-se <=> amasse);
  • forms that only appear in derived environments are no longer accepted by the dictionary in isolation (e.g. amá and lo from amá-lo should now both be flagged as spelling errors);
  • working with mesoclitics is made significantly easier (e.g. fá-lo-á doesn't require five tokens, three of which shouldn't exist in isolation!);
  • XML rules no longer require two extra (often optional) tokens for the hyphen and pronouns.

Also in this PR

  • tokenisation logic changes from defining breaking characters to defining word characters;
  • extensive changes to XML rules in all dialects to account for new tokenisation and tagging schema;
  • EnclisisFilter and ProclisisFilter to make synthesising verb forms with pronouns easier.

@p-goulart p-goulart force-pushed the pt/dict/new_tokenisation branch from 4a9b030 to e26fb73 Compare April 2, 2024 12:45
@p-goulart p-goulart changed the title [pt] XML fixes [pt] Use new POS tagging schema Apr 2, 2024
@p-goulart
Copy link
Collaborator

@marcoagpinto I'm pinging you here so you're aware of it.

After this PR goes through, you should be able to go back to editing rules and the dictionary freely. Thank you for your patience!

@marcoagpinto
Copy link
Member

@p-goulart

Heya!

Thank you for letting me know.

I missed coding rules so much 😢 😢 😢 😢 😢

@ricardojosehlima
We are soon to be back in business 😛 😛 😛 😛

@p-goulart
Copy link
Collaborator

@susanaboatto I hope you've had a chance to have a look. Since it is Friday, I want to wait until Monday to merge this to avoid surprises over the weekend.

@p-goulart p-goulart merged commit 4fe16df into master Apr 8, 2024
3 checks passed
@p-goulart p-goulart deleted the pt/dict/new_tokenisation branch April 8, 2024 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants