-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Mangling] Fix StringRef assertion failure when mangling identifiers with invalid UTF-8. #77233
base: main
Are you sure you want to change the base?
Conversation
…with invalid UTF-8. mangleIdentifier gets a zero-length string if encodePunycodeUTF8 fails, and then tries to access index 0 of it. Add an option to encodePunycodeUTF8 to repair invalid UTF-8 rather than rejecting it, and have mangleIdentifier use this option. rdar://134362682
@swift-ci please test |
I wonder if rather than using replacement characters, we should convert invalid sequences to characters in the Private Use Area; this is quite a common way to handle things where you want to preserve the original bytes of an invalid sequence somehow (e.g. by mapping byte |
} else { | ||
uint8_t second = *ptr++; | ||
if (!isContinuationByte(second)) | ||
isInvalid = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong. The Unicode Standard says, in 3.9.5 Constraints on Conversion Processes:
If the converter encounters an ill-formed UTF-8 code unit sequence which starts with a valid first byte, but which does not continue with valid successor bytes (see Table 3-7), it must not consume the successor bytes as part of the ill-formed subsequence whenever those successor bytes themselves constitute part of a well-formed UTF-8 code unit subsequence.
This means we should not be consuming the second byte in that case, but we clearly are. It goes on to give an example, namely that C2 41 42
should be understood as U+FFFD, U+0041, U+0042
rather than U+FFFD
or U+FFFD, U+0042
.
The other cases (three-byte and four-byte sequences) have the same problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I would not have thought there would be requirements on how to handle invalid sequences. The rationale makes sense now that I've seen it though.
uint8_t second = *ptr++; | ||
uint8_t third = *ptr++; | ||
if (!isContinuationByte(second) || !isContinuationByte(third)) | ||
isInvalid = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous.
uint8_t fourth = *ptr++; | ||
if (!isContinuationByte(second) || !isContinuationByte(third) | ||
|| !isContinuationByte(fourth)) | ||
isInvalid = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous.
I'm not too concerned about preserving the invalid bytes or having different input sequences remain distinct. Non-UTF-8 identifiers aren't allowed, and raise an error when you try to compile them. If someone gets one into the runtime then it will fail to match anything. The trouble shows up in Remote Mirror where we can potentially interpret garbage as Swift metadata. We just need to make sure we fail gracefully there. |
mangleIdentifier gets a zero-length string if encodePunycodeUTF8 fails, and then tries to access index 0 of it.
Add an option to encodePunycodeUTF8 to repair invalid UTF-8 rather than rejecting it, and have mangleIdentifier use this option.
rdar://134362682