-
-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-22547 fix addLikelySubtags for 4 chars script code #2687
ICU-22547 fix addLikelySubtags for 4 chars script code #2687
Conversation
Also fix ICU-22546 to correct the comments in the API doc and add additional unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow what's going on here or why this is a bug in the first place, and using "aaaa"
as a test case doesn't clarify anything for me. I get that by the rules, "aaaa"
is a syntactically well-formed language tag, but what does it mean? Is the idea that if you only have a script code, you can maximize it? That is, that "Latn"
should maximize to 'en_Latn_US"
like "und_Latn"
does? Or is something else going on here? If I guessed right about how this is supposed to work, can you use a clearer test case?
For the locale class, it does not have to mean anything. Think of a well-formed but invalid subtag like an unassigned code point. It shouldn't be an “illegal argument”, we just don't have specific data for it. In BCP 47, language subtags can be 2..8 letters. This means that ICU could, and probably should, treat 4-letter language subtags as ill-formed, but 5..8 letters as well-formed. Note that there are no valid language subtags currently that are longer than 3 letters. For testing, we need to make something up. |
@richgillam noted:
It explicitly doesn't mean anything. BCP47 reserves @markusicu said:
Maybe? BCP47 "well-formed" would disagree with you. However, we would have ample warning if someone were to try to revise BCP47 to use the reserved ALPHA4 space and it's probably better to avoid the "footgun" of trying to use a script subtag as the primary language subtag. |
Okay, so I misunderstood what was going on here. I didn't know that longer language subtags were legal. In that case, Frank's unit test is fine as it stands, and I think I'm happy with the whole PR as it stands. |
As I mentioned in the comments // ICU-22547 "aaaa" match unicode_script_subtag therefore match unicode_language_id because of the " | unicode_script_subtag" |
Also fix ICU-22546 to correct the comments in the API doc and add additional unit tests
Checklist