Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mangling] Fix StringRef assertion failure when mangling identifiers with invalid UTF-8. #77233

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mikeash
Copy link
Contributor

@mikeash mikeash commented Oct 25, 2024

mangleIdentifier gets a zero-length string if encodePunycodeUTF8 fails, and then tries to access index 0 of it.

Add an option to encodePunycodeUTF8 to repair invalid UTF-8 rather than rejecting it, and have mangleIdentifier use this option.

rdar://134362682

…with invalid UTF-8.

mangleIdentifier gets a zero-length string if encodePunycodeUTF8 fails, and then tries to access index 0 of it.

Add an option to encodePunycodeUTF8 to repair invalid UTF-8 rather than rejecting it, and have mangleIdentifier use this option.

rdar://134362682
@mikeash
Copy link
Contributor Author

mikeash commented Oct 25, 2024

@swift-ci please test

@al45tair
Copy link
Contributor

I wonder if rather than using replacement characters, we should convert invalid sequences to characters in the Private Use Area; this is quite a common way to handle things where you want to preserve the original bytes of an invalid sequence somehow (e.g. by mapping byte xx to U+f0xx). That would have the advantage that we wouldn't have matches between different invalid sequences.

} else {
uint8_t second = *ptr++;
if (!isContinuationByte(second))
isInvalid = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong. The Unicode Standard says, in 3.9.5 Constraints on Conversion Processes:

If the converter encounters an ill-formed UTF-8 code unit sequence which starts with a valid first byte, but which does not continue with valid successor bytes (see Table 3-7), it must not consume the successor bytes as part of the ill-formed subsequence whenever those successor bytes themselves constitute part of a well-formed UTF-8 code unit subsequence.

This means we should not be consuming the second byte in that case, but we clearly are. It goes on to give an example, namely that C2 41 42 should be understood as U+FFFD, U+0041, U+0042 rather than U+FFFD or U+FFFD, U+0042.

The other cases (three-byte and four-byte sequences) have the same problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I would not have thought there would be requirements on how to handle invalid sequences. The rationale makes sense now that I've seen it though.

uint8_t second = *ptr++;
uint8_t third = *ptr++;
if (!isContinuationByte(second) || !isContinuationByte(third))
isInvalid = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous.

uint8_t fourth = *ptr++;
if (!isContinuationByte(second) || !isContinuationByte(third)
|| !isContinuationByte(fourth))
isInvalid = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous.

@mikeash
Copy link
Contributor Author

mikeash commented Oct 28, 2024

I'm not too concerned about preserving the invalid bytes or having different input sequences remain distinct. Non-UTF-8 identifiers aren't allowed, and raise an error when you try to compile them. If someone gets one into the runtime then it will fail to match anything. The trouble shows up in Remote Mirror where we can potentially interpret garbage as Swift metadata. We just need to make sure we fail gracefully there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants