-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not ignore encoding just because a read length is given #74
base: master
Are you sure you want to change the base?
Conversation
I don't know the original motivation, but keep in mind if reading an exact number of bytes from multi-byte encodings, you may wind up with invalid data in your read -- the first byte only of a UTF-8 surrogate, for instance. This will be |
Yes. If I'm reading in chunks and I ask for a multibyte encoding, the chunk might end in the middle of a character, but not in the middle of a byte and the exact byte count will be properly read. Ignoring the encoding is catastrophic when stream processing CSV files with UTF-8 BOM markers. The BOM gets read in as binary and the CSV parser then falls over. See also shrinerb/shrine#585. This is critical to our workflow. There is no reason to ignore encoding if specified; the developer is aware of the issues. Simply ignoring the encoding in an undocumented way is IMHO a really bad thing to do! It leaves the developer powerless to knowledgeably implement their required stream processing workflow and leaves them diving deep down into source having to spot a very-hard-to-see little I would implore the devs to please merge this - ignoring encoding is really bad. (EDIT: Or of course - if there's another way that Shrine and Down can work together to respect encoding when loading files - that'd work too. The use case is to get the Shrine IO object & pass to CSV for reading; CSV input has BOM; |
This is actually how Ruby's file = File.open("Gemfile")
file.read.encoding #=> #<Encoding:UTF-8>
file = File.open("Gemfile")
file.read(10).encoding #=> #<Encoding:ASCII-8BIT> With |
A very good question and after a few hours I finally figured it out. The TL;DR (ish)CSV uses In particular, CSV uses
So contrary to strict reading of the Ruby core library documentation, Down::ChunkedIO implements This suggests that my PR needs changes to match the IO behaviour. Do you agree? Side noteThat 1026 is more interesting that it looks. At offset 0 to 2 inclusive are the BOM bytes. At offset 1023 - the last byte of the first 1024 of the file - is the first byte of a 3-byte UTF-8 character sequence. When we examine the string that The investigation
Once I'd given up on letting CSV anywhere near encodings and accepting that we're only going to be able to process encodings based on 8-bit bytes and where
If interested: CSV processingOur CSV system regularly encounters files in random encodings including Windows CP1252 where top-bit set characters need to be handled specially in order to convert successfully to UTF-8. I found it unavoidable to have a preprocessing step which reads the CSV file data up to some defined byte limit (at which point it gives up with whatever it's determined is the correct encoding and just hopes that's OK for the whole file). The pseudocode process is:
Once the internal encoding is determined as best we can, the process can be repeated but now we take any header or value converters specified in CSV options, flatten those into an array and unshift into first place an encoding coercion step which looks something like this: lambda do |value|
if value.is_a?(String)
value.force_encoding(internal_encoding).encode(Encoding::UTF_8, invalid: :replace, undef: :replace)
else
value
end
end If interested: On Ruby CSV and UTF-16The CSV code in Ruby code library "looks ahead" for line endings via this code: def resolve_row_separator(separator)
if separator == :auto
cr = "\r".encode(@encoding)
lf = "\n".encode(@encoding) ...so it is encoding the line terminators into whatever the prevailing encoding might be, and it was seeing this that reminded me that I was incorrectly expecting certain control characters to be single-byte in every encoding. The code then goes on to do this: if @input.is_a?(StringIO)
# ...not our use case...
elsif @input.respond_to?(:gets)
if @input.is_a?(File)
chunk_size = 32 * 1024 # <--- My test file is only around 4K, so this would read the whole file
else
chunk_size = 1024 # <--- This is what happens with S3 / Down::ChunkedIO in play
end
begin
while separator == :auto
#
# if we run out of data, it's probably a single line
# (ensure will set default value)
#
break unless sample = @input.gets(nil, chunk_size)
# ...etc...
rescue IOError
# do nothing: ensure will set default
end
end This at first glance looks correct, if the stream encoding is set (ignoring multibyte boundaries for a moment). The CSV parser does encode CR and LF and if the inbound data stream is encoded correctly you'd think it would compare. Otherwise, it would fail because Ruby has no way to compare an 8-bit binary stream with UTF-16:
...we see here that if the developer was expecting UTF-16 multibyte input, then processing a CSV file would be impossible via a stream object unless the stream data was being read correctly. You don't see it with UTF-8 as Ruby is able to figure out the conversions automatically:
...but it gets even worse. When Ruby encodes to UTF-16, it adds in the
...and in that case, if the data we read in were correctly set to UTF-16 encoding too, it doesn't work because Ruby's
...so we come full circle. At least for some encodings, Ruby CSV will simply be incapable of processing the file, because Ruby string operations and the way CSV drives them with its encoded line ending types just breaks. We fortunately haven't encountered anyone in the real world rash enough to give us a UTF-16 encoded CSV file yet! 😂 |
As far as the error in Ruby documentation goes: https://bugs.ruby-lang.org/issues/18833 |
I really appreciate your research into this. Yes, I agree |
Wait, is this it? I understood the report as different -- that it would return That could easily exceed |
Exactly right. The
The documentation is definitely wrong, relatively seriously so for such a low-level method, which is why I submitted https://bugs.ruby-lang.org/issues/18833 and will try to figure out a PR soon if there's no traction on the issue. When I look at the number of bugs in the Ruby bug tracker, it worries me greatly - they don't seem to be keeping on top of things (1791 open issues). |
The Ruby issue got rejected because they say that IMHO the information is very hard to notice because it's hidden at the start of the file without subsequent reference, but I don't have the energy to try and wade through the Ruby PR process for that. |
In the current implementation,
Down::ChunkedIO
has a very curious piece of behaviour where it deliberately ignores requested character encoding in#read
if and only if a read length is given. This breaks our application.I'm not sure why this is done. Perhaps there was a suspicion that Ruby's
#force_encoding
method might change the byte length of the data, leading to thelength
parameter not being honoured - but this is 100% not the case:It would make no sense for
#force_encoding
to change the data itself, since the whole point is to merely indicate via metadata how that identical byte representation is to be interpreted for character presentation. The documentation forDown::ChunkedIO#read
specifically says bytes not characters in its documentation:...so encoding can and should be respected;
#force_encoding
is the right way to do it; and the presence of alength
parameter should not influence that behaviour.This PR changes behaviour accordingly, with appropriately updated tests.