-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-16 string literals #14670
Comments
Simply porting #14671 works fine, explicit math operations on 16-bit integers are not needed. class String
macro utf16_literal(data)
{%
arr = [] of NumberLiteral
data.chars.each do |c|
c = c.ord
if c < 0x1_0000
arr << c
else
c -= 0x1_0000
arr << 0xd800 + ((c >> 10) & 0x3ff)
arr << 0xdc00 + (c & 0x3ff)
end
end
arr << 0
%}
Slice(UInt16).literal({{arr.splat}})[0, {{arr.size - 1}}]
end
end
s = String.utf16_literal("TEST 😐🐙 ±∀ の")
# => Slice[84, 69, 83, 84, 32, 55357, 56848, 55357, 56345, 32, 177, 8704, 32, 12398]
String.from_utf16(s)
# => "TEST 😐🐙 ±∀ の" Encoding 10000 characters takes around 300ms. EDIT: Added a final 0 byte |
Looks like a winner, then 🚀
Yeah, this is mainly for relatively short strings, so performance should not be an issue. Btw. |
In order to make it actually static data, we'd also need a slice literal (#2886). |
The version from my comment uses the literals from #13716, so it is static data in this case. |
Worth noting that Windows supports UTF8 now and encourages use of those APIs https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page#-a-vs--w-apis So the conversations could be avoided entirely |
Would be nice. But I believe we're quite a bit away from that. The Windows ecosystem is huge and it has 30 years of wide chars in it. |
@straight-shoota this is reusing the "old" ANSI API to use the UTF-8 codepage, so it might just work 🤷 It took me a while to find this: at the above link there is the explanation to set the Active Code Page (ACP) to UTF-8 which requires a manifest and calling an EXE to "add the manifest" to an executable. Then the executable the ANSI variant of the Windows API will use UTF-8. That being said, it requires Windows 10 v1903 (2019) and GDI applications won't support it unless the user activates a beta setting. |
The macro is nice, but if we want to eventually have the compiler optimize it, maybe we could just expose the |
Hm, that's an interesting idea. Exposing FTR: Eventual compiler optimization would also be possible with Let's focus on UTF-16 string literals here and continue the discussion about UTF-8 support on Win32 in a different issue. I'm pretty sure we won't lose all use cases for UTF-16 string literals over night, so this will still be useful. |
The difficulty to implement |
It could return Btw I just tested the performance of my macro code a bit more. The macro language actually isn't that slow - the parser is. Implementing Maybe there should be a way to create AST nodes directly inside the macro language, so we don't have to parse everything again. |
You can activate the code pages in code, this is how applications like MS Edge browser run. |
Do we want to proceed with |
I like |
When working with Windows APIs, it's common that we need UTF-16 strings (instead of Crystal's
String
which is UTF-8).String#to_utf16
is available for conversion.But most use cases of this method in stdlib are actually for string literals (e.g.
"Content Type".to_utf16
). This is a bit unnecessary because it means the string transformation happens at runtime, while it could be entirely at compile time, avoding extra computation and allocation.A particularly intricate use case is in #14659 where we must not allocate at all. So it ends up with such a mechanism to achive compile time conversion:
UInt16.static_array({% for chr in "CRYSTAL_TRACE".chars %}{{chr.ord}}, {% end %} 0)
.This certainly works, at least for this limited use case. But it fails for code points outside the Basic Multilingual Plane. So it's not a generic solution.
It would be nice if we had an easy tool for creating UTF-16 encoded strings.
Maybe the converstion algorithm from
String#to_utf16
could be implemented as a macro method? It's a bit complex, but not too much. I don't think we can explicitly do math operations on 16-bit integers in the macro language, though.An alternative would be to expose a compiler primitive for UTF-16 conversion.
Related: #2886
The text was updated successfully, but these errors were encountered: