Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native text encoding conversions #15001

Open
HertzDevil opened this issue Sep 14, 2024 · 4 comments
Open

Native text encoding conversions #15001

HertzDevil opened this issue Sep 14, 2024 · 4 comments

Comments

@HertzDevil
Copy link
Contributor

HertzDevil commented Sep 14, 2024

Crystal currently relies on iconv or GNU libiconv for conversions between text encodings. This has a few problems:

  • iconv does not guarantee the support for any encoding at all, yet it doesn't provide a standard way to query or enumerate this information. (The nonstandard iconvlist or libiconvlist is present in BSD libc and GNU libiconv respectively.) For all we know, an iconv implementation that doesn't support UTF-8 nor UTF-16 is still POSIX-compliant. The same goes for the invalid: :skip option.
  • The standard library already has separate APIs to deal with UTF-16, and technically UTF-32 too if we consider Char to be equivalent to Int32, yet they are not integrated into the usual transcoding APIs like String#encode and IO#set_encoding. In particular, it makes sense that these encodings should remain supported in those places, even when -Dwithout_iconv is defined.
  • Some system iconv implementations are known to be buggy, such as the macOS one and the Android one (Bionic libc, API level 28+).
  • GNU libiconv being licensed under LGPLv2.1 complicates certain deployment scenarios.

The essence of, for example, UTF-16 to UTF-8 conversion can be implemented on top of iconv's function signature as:

def iconv_utf16_to_utf8(in_buffer : UInt8**, in_buffer_left : Int32*, out_buffer : UInt8**, out_buffer_left : Int32*)
  utf16_slice = in_buffer.value.to_slice(in_buffer_left.value).unsafe_slice_of(UInt16)
  String.each_utf16_char(utf16_slice) do |ch|
    in_bytesize = ch.ord >= 0x10000 ? 4 : 2
    ch_bytesize = ch.bytesize
    break unless out_buffer_left.value >= ch_bytesize

    ch.each_byte do |b|
      out_buffer.value.value = b
      out_buffer.value += 1
    end

    in_buffer.value += in_bytesize
    in_buffer_left.value -= in_bytesize
    out_buffer_left.value -= ch_bytesize
  end
end

str = Bytes[0x61, 0x00, 0x62, 0x00, 0x3D, 0xD8, 0x02, 0xDE, 0x63, 0x00]
bytes = uninitialized UInt8[32]

in_buffer = str.to_unsafe
in_buffer_left = str.bytesize
out_buffer = bytes.to_unsafe
out_buffer_left = bytes.size
iconv_utf16_to_utf8(pointerof(in_buffer), pointerof(in_buffer_left), pointerof(out_buffer), pointerof(out_buffer_left))

String.new(bytes.to_slice[0, bytes.size - out_buffer_left]) # => "ab😂c"

Going in the opposite direction would need something like #13639 to be equally concise, but the point is that we could indeed achieve this without using iconv at all. If both the source and destination encodings are one of UTF-8, UTF-16, UTF-32, or maybe ASCII, then we could use our own native transcoders instead of iconv; or if we are ambitious enough, we could port the entire set of ICU character set mapping tables in an automated manner, and remove our dependency on iconv.

@ysbaddaden
Copy link
Contributor

A pure crystal implementation would be lovely. For the sake of the argument, are there alternatives to libiconv?

@HertzDevil
Copy link
Contributor Author

@ysbaddaden
Copy link
Contributor

Thank you 🙇

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Sep 16, 2024

The W3C Encoding Standard already sets the bar quite high, but seems to support a good list of general encodings 👍

There's a part 2 to the comparison article that focuses on C and presents ztd.cuneicode. I'm not saying we should use it, but it sounds like a solid reference, and both articles are treasure trove of information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants