ICU-22547 fix addLikelySubtags for 4 chars script code #2687

FrankYFTang · 2023-10-27T01:00:22Z

Also fix ICU-22546 to correct the comments in the API doc and add additional unit tests

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22547
Required: The PR title must be prefixed with a JIRA Issue number.
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

Also fix ICU-22546 to correct the comments in the API doc and add additional unit tests

FrankYFTang · 2023-10-27T01:00:34Z

richgillam

I don't follow what's going on here or why this is a bug in the first place, and using "aaaa" as a test case doesn't clarify anything for me. I get that by the rules, "aaaa" is a syntactically well-formed language tag, but what does it mean? Is the idea that if you only have a script code, you can maximize it? That is, that "Latn" should maximize to 'en_Latn_US" like "und_Latn" does? Or is something else going on here? If I guessed right about how this is supposed to work, can you use a clearer test case?

markusicu · 2023-10-27T15:40:29Z

I get that by the rules, "aaaa" is a syntactically well-formed language tag, but what does it mean?

For the locale class, it does not have to mean anything. Think of a well-formed but invalid subtag like an unassigned code point. It shouldn't be an “illegal argument”, we just don't have specific data for it.

In BCP 47, language subtags can be 2..8 letters.
CLDR actually forbids 4-letter language subtags: https://www.unicode.org/reports/tr35/#unicode_language_subtag

This means that ICU could, and probably should, treat 4-letter language subtags as ill-formed, but 5..8 letters as well-formed.

Note that there are no valid language subtags currently that are longer than 3 letters. For testing, we need to make something up.

aphillips · 2023-10-27T16:29:25Z

@richgillam noted:

I get that by the rules, "aaaa" is a syntactically well-formed language tag, but what does it mean?

It explicitly doesn't mean anything. BCP47 reserves ALPHA4 subtags for future use (that is, there are definitely no valid language tags with alpha4 primary language subtags). It is possible to register a subtag with 5 to 8 characters as a primary language subtag, but BCP47 is super-frowny about trying to do that and there aren't any like this (or likely to be any).

@markusicu said:

This means that ICU could, and probably should, treat 4-letter language subtags as ill-formed, but 5..8 letters as well-formed.

Maybe? BCP47 "well-formed" would disagree with you. However, we would have ample warning if someone were to try to revise BCP47 to use the reserved ALPHA4 space and it's probably better to avoid the "footgun" of trying to use a script subtag as the primary language subtag.

richgillam · 2023-10-27T17:36:48Z

It explicitly doesn't mean anything. BCP47 reserves ALPHA4 subtags for future use (that is, there are definitely no valid language tags with alpha4 primary language subtags).

Okay, so I misunderstood what was going on here. I didn't know that longer language subtags were legal. In that case, Frank's unit test is fine as it stands, and I think I'm happy with the whole PR as it stands.

FrankYFTang · 2023-10-28T00:28:09Z

As I mentioned in the comments

// ICU-22547
// unicode_language_id = "root" |
// (unicode_language_subtag (sep unicode_script_subtag)? | unicode_script_subtag)
// (sep unicode_region_subtag)? (sep unicode_variant_subtag)* ;
// so "aaaa" is a well-formed unicode_language_id

"aaaa" match unicode_script_subtag therefore match unicode_language_id because of the " | unicode_script_subtag"

ICU-22547 fix addLikelySubtags for 4 chars script code

51da31e

Also fix ICU-22546 to correct the comments in the API doc and add additional unit tests

FrankYFTang requested review from richgillam and markusicu October 27, 2023 01:00

richgillam reviewed Oct 27, 2023

View reviewed changes

richgillam approved these changes Oct 27, 2023

View reviewed changes

FrankYFTang merged commit 92eeb45 into unicode-org:main Oct 28, 2023

FrankYFTang deleted the ICU-22547-addlinkelyaaa branch October 28, 2023 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU-22547 fix addLikelySubtags for 4 chars script code #2687

ICU-22547 fix addLikelySubtags for 4 chars script code #2687

FrankYFTang commented Oct 27, 2023

FrankYFTang commented Oct 27, 2023

richgillam left a comment

markusicu commented Oct 27, 2023

aphillips commented Oct 27, 2023

richgillam commented Oct 27, 2023

FrankYFTang commented Oct 28, 2023

ICU-22547 fix addLikelySubtags for 4 chars script code #2687

ICU-22547 fix addLikelySubtags for 4 chars script code #2687

Conversation

FrankYFTang commented Oct 27, 2023

Checklist

FrankYFTang commented Oct 27, 2023

richgillam left a comment

Choose a reason for hiding this comment

markusicu commented Oct 27, 2023

aphillips commented Oct 27, 2023

richgillam commented Oct 27, 2023

FrankYFTang commented Oct 28, 2023