Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22547 fix addLikelySubtags for 4 chars script code #2687

Merged
merged 1 commit into from
Oct 28, 2023

Conversation

FrankYFTang
Copy link
Contributor

Also fix ICU-22546 to correct the comments in the API doc and add additional unit tests

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22547
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

Also fix ICU-22546 to correct the comments in the API doc
and add additional unit tests
@FrankYFTang
Copy link
Contributor Author

@anba

Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what's going on here or why this is a bug in the first place, and using "aaaa" as a test case doesn't clarify anything for me. I get that by the rules, "aaaa" is a syntactically well-formed language tag, but what does it mean? Is the idea that if you only have a script code, you can maximize it? That is, that "Latn" should maximize to 'en_Latn_US" like "und_Latn" does? Or is something else going on here? If I guessed right about how this is supposed to work, can you use a clearer test case?

@markusicu
Copy link
Member

I get that by the rules, "aaaa" is a syntactically well-formed language tag, but what does it mean?

For the locale class, it does not have to mean anything. Think of a well-formed but invalid subtag like an unassigned code point. It shouldn't be an “illegal argument”, we just don't have specific data for it.

In BCP 47, language subtags can be 2..8 letters.
CLDR actually forbids 4-letter language subtags: https://www.unicode.org/reports/tr35/#unicode_language_subtag

This means that ICU could, and probably should, treat 4-letter language subtags as ill-formed, but 5..8 letters as well-formed.

Note that there are no valid language subtags currently that are longer than 3 letters. For testing, we need to make something up.

@aphillips
Copy link
Member

@richgillam noted:

I get that by the rules, "aaaa" is a syntactically well-formed language tag, but what does it mean?

It explicitly doesn't mean anything. BCP47 reserves ALPHA4 subtags for future use (that is, there are definitely no valid language tags with alpha4 primary language subtags). It is possible to register a subtag with 5 to 8 characters as a primary language subtag, but BCP47 is super-frowny about trying to do that and there aren't any like this (or likely to be any).

@markusicu said:

This means that ICU could, and probably should, treat 4-letter language subtags as ill-formed, but 5..8 letters as well-formed.

Maybe? BCP47 "well-formed" would disagree with you. However, we would have ample warning if someone were to try to revise BCP47 to use the reserved ALPHA4 space and it's probably better to avoid the "footgun" of trying to use a script subtag as the primary language subtag.

@richgillam
Copy link
Contributor

It explicitly doesn't mean anything. BCP47 reserves ALPHA4 subtags for future use (that is, there are definitely no valid language tags with alpha4 primary language subtags).

Okay, so I misunderstood what was going on here. I didn't know that longer language subtags were legal. In that case, Frank's unit test is fine as it stands, and I think I'm happy with the whole PR as it stands.

@FrankYFTang
Copy link
Contributor Author

As I mentioned in the comments

// ICU-22547
// unicode_language_id = "root" |
// (unicode_language_subtag (sep unicode_script_subtag)? | unicode_script_subtag)
// (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;
// so "aaaa" is a well-formed unicode_language_id

"aaaa" match unicode_script_subtag therefore match unicode_language_id because of the " | unicode_script_subtag"

@FrankYFTang FrankYFTang merged commit 92eeb45 into unicode-org:main Oct 28, 2023
@FrankYFTang FrankYFTang deleted the ICU-22547-addlinkelyaaa branch October 28, 2023 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants