-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unclear opinion about encodings #114
Comments
To be honest I don't really grasp what would need to be added to handle UTF-16 encoding for instance 🤔 But anyhow, here's what I think:
|
@asticode So UTF-16 encoding encodes code points as 16 bit vs UTF-8's code points being 8-bit, so one character will be two bytes vs one byte, so you can imagine a character represented in UTF-8 by the hex value I think that UTF-8 being so predominant on the web and since UTF-16 makes up such a small subset of cases, that it might not be worth supporting other encodings but yeah, validating that something is UTF-8 before trying to read the bytes using parser x would probably be good. I can make a PR |
A PR would be great ! 👍 |
OK, I've got a working idea of how I want to do in this draft, let me know what you think. I am unsure about the other file formats supported here and if they also expect input to be UTF-8, but probably safe to do to those as well if you agree. |
Another good call out would be that we can probably perform the check on the first line of the file, as the bytes of the first line will either be valid or invalid, signalling that the rest of the file is either totally invalid, or most likely valid. Thoughts? |
I'd rather create one global unexported method We would then use this method for srt and webvtt. I'll be honest, I'm not sure about the other sub formats as well, but I'd rather not add this method if it's not useful, therefore I'd rather add it if we get concrete issues with non-utf8 with those other formats. |
@asticode Cool, that sounds good let me update my PR |
FWIW I also tested a TTML file that isn't UTF-8 encoded and it errors at the xml parser level: |
@asticode Yeah after some testing that actually won't work I think. Since the |
Good point 👍 I've checked out the |
I've updated my branch to go back to testing each line to be valid UTF-8 in these readers, and I've added benchmark tests and ran them against my branch vs. master and they seem to be within error. master:
utf-8-detection:
|
Can you provide benchmarks with the open file operation outside of the benchmark loop? Something like // Read file
// Loop
// Read srt/webvtt with content of the file as []byte |
master:
utf-8-detection:
|
@asticode OK I've added some UTF-16LE test files. The github UI doesn't seem to display them differently in any way but test asserting an error from them passes and you can check them yourself:
|
Currently, there are places where astisub behaves in bizarres ways when passed a file that is not UTF-8 encoded.
For example, of the SRT and WebVTT readers, neither seem to explicitly error in this case, but all of the parsing logic doesn't work correctly since the bytes are not represented the same in the source code and the file that is being parsed (i.e UTF16 has the extra
\x00
bits per bytes so we can't just check stringA == stringB since there's padding bytes).So I guess my question is then does this library aim to support these other encodings, or only aims to support the by far more popular UTF-8? There is a notion in the codebase of the bytes order marks that are used by non UTF-8 encodings, but they seem to just be always stripped out and not actually used to check payload encoding. Personally,I think that is perfectly fine thing to not support non-UTF-8 formats, but in that case I do think this codebase might want to make a few changes to check the encoding of what it is being passed i.e a
utf.Valid(...)
call on the bytes passed in or something like that to distinctly clarify what is and what isn't supported.The text was updated successfully, but these errors were encountered: