[Mangling] Fix StringRef assertion failure when mangling identifiers with invalid UTF-8. #77233

mikeash · 2024-10-25T20:44:23Z

mangleIdentifier gets a zero-length string if encodePunycodeUTF8 fails, and then tries to access index 0 of it.

Add an option to encodePunycodeUTF8 to repair invalid UTF-8 rather than rejecting it, and have mangleIdentifier use this option.

rdar://134362682

…with invalid UTF-8. mangleIdentifier gets a zero-length string if encodePunycodeUTF8 fails, and then tries to access index 0 of it. Add an option to encodePunycodeUTF8 to repair invalid UTF-8 rather than rejecting it, and have mangleIdentifier use this option. rdar://134362682

mikeash · 2024-10-25T20:44:37Z

@swift-ci please test

al45tair · 2024-10-28T11:37:55Z

I wonder if rather than using replacement characters, we should convert invalid sequences to characters in the Private Use Area; this is quite a common way to handle things where you want to preserve the original bytes of an invalid sequence somehow (e.g. by mapping byte xx to U+f0xx). That would have the advantage that we wouldn't have matches between different invalid sequences.

al45tair · 2024-10-28T11:47:51Z

lib/Demangling/Punycode.cpp

+      } else {
+        uint8_t second = *ptr++;
+        if (!isContinuationByte(second))
+          isInvalid = true;


I think this is wrong. The Unicode Standard says, in 3.9.5 Constraints on Conversion Processes:

If the converter encounters an ill-formed UTF-8 code unit sequence which starts with a valid first byte, but which does not continue with valid successor bytes (see Table 3-7), it must not consume the successor bytes as part of the ill-formed subsequence whenever those successor bytes themselves constitute part of a well-formed UTF-8 code unit subsequence.

This means we should not be consuming the second byte in that case, but we clearly are. It goes on to give an example, namely that C2 41 42 should be understood as U+FFFD, U+0041, U+0042 rather than U+FFFD or U+FFFD, U+0042.

The other cases (three-byte and four-byte sequences) have the same problem.

Interesting. I would not have thought there would be requirements on how to handle invalid sequences. The rationale makes sense now that I've seen it though.

al45tair · 2024-10-28T11:48:08Z

lib/Demangling/Punycode.cpp

+        uint8_t second = *ptr++;
+        uint8_t third = *ptr++;
+        if (!isContinuationByte(second) || !isContinuationByte(third))
+          isInvalid = true;


See previous.

al45tair · 2024-10-28T11:48:15Z

lib/Demangling/Punycode.cpp

+        uint8_t fourth = *ptr++;
+        if (!isContinuationByte(second) || !isContinuationByte(third)
+            || !isContinuationByte(fourth))
+          isInvalid = true;


See previous.

mikeash · 2024-10-28T12:58:48Z

I'm not too concerned about preserving the invalid bytes or having different input sequences remain distinct. Non-UTF-8 identifiers aren't allowed, and raise an error when you try to compile them. If someone gets one into the runtime then it will fail to match anything. The trouble shows up in Remote Mirror where we can potentially interpret garbage as Swift metadata. We just need to make sure we fail gracefully there.

mikeash requested review from al45tair and btroller October 25, 2024 20:44

mikeash requested a review from rjmccall as a code owner October 25, 2024 20:44

al45tair reviewed Oct 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mangling] Fix StringRef assertion failure when mangling identifiers with invalid UTF-8. #77233

[Mangling] Fix StringRef assertion failure when mangling identifiers with invalid UTF-8. #77233

mikeash commented Oct 25, 2024

mikeash commented Oct 25, 2024

al45tair commented Oct 28, 2024

al45tair Oct 28, 2024

mikeash Oct 28, 2024

al45tair Oct 28, 2024

al45tair Oct 28, 2024

mikeash commented Oct 28, 2024

[Mangling] Fix StringRef assertion failure when mangling identifiers with invalid UTF-8. #77233

Are you sure you want to change the base?

[Mangling] Fix StringRef assertion failure when mangling identifiers with invalid UTF-8. #77233

Conversation

mikeash commented Oct 25, 2024

mikeash commented Oct 25, 2024

al45tair commented Oct 28, 2024

al45tair Oct 28, 2024

Choose a reason for hiding this comment

mikeash Oct 28, 2024

Choose a reason for hiding this comment

al45tair Oct 28, 2024

Choose a reason for hiding this comment

al45tair Oct 28, 2024

Choose a reason for hiding this comment

mikeash commented Oct 28, 2024