-
-
Notifications
You must be signed in to change notification settings - Fork 769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-22404 Unicode 15.1 beta data files & API constants #2492
Conversation
echeran
commented
Jun 5, 2023
•
edited by markusicu
Loading
edited by markusicu
- Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
- Required: Each commit message must be prefixed with a JIRA Issue number.
- Issue accepted (done by Technical Committee after discussion)
- Tests included, if applicable
- API docs and/or User Guide docs changed or added, if applicable
Hi @eggrobin this branch/PR has pretty much all of the Unicode 15.1 beta changes for ICU: New property values (though not yet new properties), updated data & test data, collation... Could you please try to implement the BreakIterator rule changes and add commits to this branch? The biggest change is the one for orthographic syllables:
There is also: Line breaking around « quotation marks » Was there something else in this area? FYI @FrankYFTang |
I cannot, probably because I do not have write access either to this repository nor to @echeran’s. echeran#48 has the changes. |
@echeran could you please rebase, and resolve the conflicts (jar files)? We should try to finish this soon. |
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
… on my machine it worked. I’ll look at this once ICU-TC lets me in on Thursday. |
You should not need to wait with debugging before we add you to the GitHub team. |
@eggrobin it turned out that we just needed to refresh the Unicode 15.1 data in the ICU4J data jar file. Now, at least locally, we no longer see RBBI test failures. There are a few other failures that I think (hope) are not related to BreakIterator. |
The ants seem happier, but Windows still seems sad. Interestingly, I am now unable to build this branch on Windows:
I do not see that error on CI, but I see this, which I would naïvely think suggests that something went wrong when building:
(I have no idea where to look beyond that.) |
@eggrobin @mihnita On a hunch, Elango and I fixed some charset encoding issues in C++ code, but most of the Windows builds still have rbbi test failures. For example, MSVC x64 Debug (VS2019) / Run x64 Debug Tests. Linux & macOS pass. Java passes. Our best guess is still that it somehow has to do with Windows defaulting to the windows-1252 charset rather than UTF-8, and that maybe Windows-on-ARM and Robin's Windows machine are set to use UTF-8 as the default charset. But we apparently failed to stare down the problem in the code diffs. (If the charset is the problem, then somewhere a non-ASCII @rp9-next |
If this were indeed something about how the data files are built, then someone who has access to both a macOS or Linux machine, as well as a Windows machine with the windows-1252 default codepage, could check if the binary files (mostly .brk files) are the same. |
That one doesn’t have the Run Tests step, which is the one that fails, so I think that is a red herring. I still have not managed to build this branch on my Windows machines (obviously it used to build before the rebase, with the RBBI tests passing), but I also can’t seem to build main, so I suspect I may have some local configuration that still thinks it is 73; my local issues are probably unrelated to the CI issues. |
Figured that out, the tools in Having managed to build this branch… on my machine(s) it works:
They are.
Hm. I can try messing with that setting. |
Alright, setting my system to not-UTF-8, I can reproduce the failures:
eggrobin@9c6cc1c (which you will have to cherry-pick since I cannot push commits to this branch) fixes the tests. We should file a follow-up ticket to read the rule files as UTF-8; they are already non-ASCII UTF-8 interpreted as whichever system codepage (the comments are not ASCII). --- a/icu4c/source/tools/genbrk/genbrk.cpp
+++ b/icu4c/source/tools/genbrk/genbrk.cpp
@@ -234,8 +234,11 @@ int main(int argc, char **argv) {
if (U_FAILURE(status)) {
exit(status);
}
- if(encoding!=nullptr ){
- ruleSourceC += signatureLength;
+ if (encoding == nullptr) {
+ // In the absence of a BOM, assume the rule file is in UTF-8.
+ encoding = "UTF-8";
+ } else {
+ ruleSourceC += signatureLength;
ruleFileSize -= signatureLength;
}
Having hopefully figured that one out, I will put my system back to UTF-8 before something breaks. |
I've checked the |
See unicode-org#2492 Co-authored-by: Andy Heninger <[email protected]> Co-authored-by: Robin Leroy <[email protected]>
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
I applied the patch that @eggrobin included in his previous comment to modify genbrk (the alternative to editing the break rule files by escaping non-ASCII characters). After the change, |
I think I have a fix (tested by once again setting my machine to codepage 1252 😩), but I still seem to be unable to push commits to this branch, so have another diff: --- a/icu4c/source/test/intltest/rbbiapts.cpp
+++ b/icu4c/source/test/intltest/rbbiapts.cpp
@@ -1030,7 +1030,7 @@ void RBBIAPITest::RoundtripRule(const char *dataFile) {
parseError.offset = 0;
LocalUDataMemoryPointer data(udata_open(U_ICUDATA_BRKITR, "brk", dataFile, &status));
uint32_t length;
- const char *builtSource;
+ UnicodeString builtSource;
const uint8_t *rbbiRules;
const uint8_t *builtRules;
@@ -1040,12 +1040,13 @@ void RBBIAPITest::RoundtripRule(const char *dataFile) {
}
builtRules = (const uint8_t *)udata_getMemory(data.getAlias());
- builtSource = (const char *)(builtRules + ((RBBIDataHeader*)builtRules)->fRuleSource);
+ builtSource = UnicodeString::fromUTF8(
+ (const char *)(builtRules + ((RBBIDataHeader *)builtRules)->fRuleSource));
LocalPointer<RuleBasedBreakIterator> brkItr (new RuleBasedBreakIterator(builtSource, parseError, status));
if (U_FAILURE(status)) {
errln("%s:%d createRuleBasedBreakIterator: ICU Error \"%s\" at line %d, column %d\n",
__FILE__, __LINE__, u_errorName(status), parseError.line, parseError.offset);
- errln(UnicodeString(builtSource));
+ errln(builtSource);
return;
}
rbbiRules = brkItr->getBinaryRules(length); |
See unicode-org#2492 Co-authored-by: Andy Heninger <[email protected]> Co-authored-by: Robin Leroy <[email protected]>
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Note that this PR fixes ICU-22039; I don’t know whether we have a scheme for marking a commit as related to multiple issues. |
See unicode-org#2492 Co-authored-by: Andy Heninger <[email protected]> Co-authored-by: Robin Leroy <[email protected]>
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
"All checks have passed" -- super!! thanks everyone!! |
I just closed that with resolution "Fixed by Other Ticket" and a comment referring back here. |
if (encoding == nullptr) { | ||
// In the absence of a BOM, assume the rule file is in UTF-8. | ||
encoding = "UTF-8"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a change in behavior of this tool.
In a follow-up change, we should document (genbrk usage output, maybe User Guide) that genbrk looks for a Unicode signature byte sequence and otherwise assumes UTF-8.
https://unicode.org/faq/utf_bom.html
https://www.unicode.org/glossary/#unicode_signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm great!!
See unicode-org#2492 Co-authored-by: Andy Heninger <[email protected]> Co-authored-by: Robin Leroy <[email protected]>