Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear opinion about encodings #114

Closed
nakkamarra opened this issue Oct 2, 2024 · 14 comments
Closed

Unclear opinion about encodings #114

nakkamarra opened this issue Oct 2, 2024 · 14 comments

Comments

@nakkamarra
Copy link
Contributor

Currently, there are places where astisub behaves in bizarres ways when passed a file that is not UTF-8 encoded.

For example, of the SRT and WebVTT readers, neither seem to explicitly error in this case, but all of the parsing logic doesn't work correctly since the bytes are not represented the same in the source code and the file that is being parsed (i.e UTF16 has the extra \x00 bits per bytes so we can't just check stringA == stringB since there's padding bytes).

So I guess my question is then does this library aim to support these other encodings, or only aims to support the by far more popular UTF-8? There is a notion in the codebase of the bytes order marks that are used by non UTF-8 encodings, but they seem to just be always stripped out and not actually used to check payload encoding. Personally,I think that is perfectly fine thing to not support non-UTF-8 formats, but in that case I do think this codebase might want to make a few changes to check the encoding of what it is being passed i.e a utf.Valid(...) call on the bytes passed in or something like that to distinctly clarify what is and what isn't supported.

@asticode
Copy link
Owner

asticode commented Oct 3, 2024

To be honest I don't really grasp what would need to be added to handle UTF-16 encoding for instance 🤔

But anyhow, here's what I think:

  • since this lib can only handle UTF-8, I do agree that it should return an error on reading if the content is not UTF-8
  • I'm not against adding compatibility with non UTF-8, it will just have to go through PRs

@nakkamarra
Copy link
Contributor Author

@asticode So UTF-16 encoding encodes code points as 16 bit vs UTF-8's code points being 8-bit, so one character will be two bytes vs one byte, so you can imagine a character represented in UTF-8 by the hex value 0x01, it's value in UTF-16 would be 0x01 0x00 big endian or 0x00 0x01 little endian, the BOM in this case dictates the endianness of the encoding (whereas in UTF-8 it's kind of just a placeholder, it is sometimes considered optional).

I think that UTF-8 being so predominant on the web and since UTF-16 makes up such a small subset of cases, that it might not be worth supporting other encodings but yeah, validating that something is UTF-8 before trying to read the bytes using parser x would probably be good. I can make a PR

@asticode
Copy link
Owner

asticode commented Oct 3, 2024

A PR would be great ! 👍

@nakkamarra
Copy link
Contributor Author

OK, I've got a working idea of how I want to do in this draft, let me know what you think. I am unsure about the other file formats supported here and if they also expect input to be UTF-8, but probably safe to do to those as well if you agree.

@nakkamarra
Copy link
Contributor Author

Another good call out would be that we can probably perform the check on the first line of the file, as the bytes of the first line will either be valid or invalid, signalling that the rest of the file is either totally invalid, or most likely valid. Thoughts?

@asticode
Copy link
Owner

asticode commented Oct 4, 2024

I'd rather create one global unexported method isValidUTF8Reader(i io.Reader) bool which would wrap the input reader into a bufio.Reader and use its .Peek() method to get the first bytes and asses whether it's valid UTF-8 🤔

We would then use this method for srt and webvtt. I'll be honest, I'm not sure about the other sub formats as well, but I'd rather not add this method if it's not useful, therefore I'd rather add it if we get concrete issues with non-utf8 with those other formats.

@nakkamarra
Copy link
Contributor Author

@asticode Cool, that sounds good let me update my PR

@nakkamarra
Copy link
Contributor Author

FWIW I also tested a TTML file that isn't UTF-8 encoded and it errors at the xml parser level:
astisub: xml decoding failed: XML syntax error on line 1: invalid UTF-8

@nakkamarra
Copy link
Contributor Author

@asticode Yeah after some testing that actually won't work I think. Since the ReadFrom* functions' reader arguments are just an io.Reader type, the implication is that they cannot be read multiple times (i.e imagine the reader is a TCP socket or some other network stream that just sends bytes) and so we can't created a bufio Reader from it and then also create the bufio Scanner that VTT / SRT ReadFrom* functions are using. So I'm not too sure what to do, all the solutions I'm thinking of would basically require copying the contents of the reader into memory like with something likes bytes.Buffer (which, with something like a ~500MB-1GB file or read stream, would make astisub way too memory hungry)

@asticode
Copy link
Owner

asticode commented Oct 6, 2024

Good point 👍 I've checked out the xml package and it actually performs the check on each provided text inputs. I think the best alternative would be to go back to your previous idea, and use utf8.Valid() whenever needed. The only remaining question in my mind is whether to run this check on the whole lines provided by the scanners, or afterwards? 🤔 The first option is simpler but I want to make sure it captures all the cases

@nakkamarra
Copy link
Contributor Author

I've updated my branch to go back to testing each line to be valid UTF-8 in these readers, and I've added benchmark tests and ran them against my branch vs. master and they seem to be within error.

master:

go test -bench=.
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: unknown section: [Unknown]
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: unknown section: [Unknown]
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: found another voice name "Bob" in "<v Joe>Joe says something</v> <v Bob>Bob says something</v>". Ignore
goos: darwin
goarch: arm64
pkg: github.com/asticode/go-astisub
BenchmarkOpenSRT-12                81859             12394 ns/op
BenchmarkOpenWebVTT-12             58040             19924 ns/op
PASS
ok      github.com/asticode/go-astisub  2.774s

utf-8-detection:

go test -bench=.   
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: unknown section: [Unknown]
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: unknown section: [Unknown]
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: found another voice name "Bob" in "<v Joe>Joe says something</v> <v Bob>Bob says something</v>". Ignore
goos: darwin
goarch: arm64
pkg: github.com/asticode/go-astisub
BenchmarkOpenSRT-12                81410             12321 ns/op
BenchmarkOpenWebVTT-12             58173             20664 ns/op
PASS
ok      github.com/asticode/go-astisub  2.822s

@asticode
Copy link
Owner

asticode commented Oct 7, 2024

Can you provide benchmarks with the open file operation outside of the benchmark loop? Something like

// Read file
// Loop
  // Read srt/webvtt with content of the file as []byte

@nakkamarra
Copy link
Contributor Author

master:

BenchmarkOpenSRT-12              1412808               814.1 ns/op
BenchmarkOpenWebVTT-12           1438155               844.0 ns/op

utf-8-detection:

BenchmarkOpenSRT-12              1404628               842.6 ns/op
BenchmarkOpenWebVTT-12           1410850               844.6 ns/op

@nakkamarra
Copy link
Contributor Author

@asticode OK I've added some UTF-16LE test files. The github UI doesn't seem to display them differently in any way but test asserting an error from them passes and you can check them yourself:

> file -I go-astisub/testdata/example-in-non-utf8.vtt
go-astisub/testdata/example-in-non-utf8.vtt: text/plain; charset=utf-16le

> file -I go-astisub/testdata/example-in-non-utf8.srt
go-astisub/testdata/example-in-non-utf8.srt: text/plain; charset=utf-16le

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants