Unclear opinion about encodings #114

nakkamarra · 2024-10-02T19:53:20Z

Currently, there are places where astisub behaves in bizarres ways when passed a file that is not UTF-8 encoded.

For example, of the SRT and WebVTT readers, neither seem to explicitly error in this case, but all of the parsing logic doesn't work correctly since the bytes are not represented the same in the source code and the file that is being parsed (i.e UTF16 has the extra \x00 bits per bytes so we can't just check stringA == stringB since there's padding bytes).

So I guess my question is then does this library aim to support these other encodings, or only aims to support the by far more popular UTF-8? There is a notion in the codebase of the bytes order marks that are used by non UTF-8 encodings, but they seem to just be always stripped out and not actually used to check payload encoding. Personally,I think that is perfectly fine thing to not support non-UTF-8 formats, but in that case I do think this codebase might want to make a few changes to check the encoding of what it is being passed i.e a utf.Valid(...) call on the bytes passed in or something like that to distinctly clarify what is and what isn't supported.

The text was updated successfully, but these errors were encountered:

asticode · 2024-10-03T07:30:27Z

To be honest I don't really grasp what would need to be added to handle UTF-16 encoding for instance 🤔

But anyhow, here's what I think:

since this lib can only handle UTF-8, I do agree that it should return an error on reading if the content is not UTF-8
I'm not against adding compatibility with non UTF-8, it will just have to go through PRs

nakkamarra · 2024-10-03T13:23:56Z

@asticode So UTF-16 encoding encodes code points as 16 bit vs UTF-8's code points being 8-bit, so one character will be two bytes vs one byte, so you can imagine a character represented in UTF-8 by the hex value 0x01, it's value in UTF-16 would be 0x01 0x00 big endian or 0x00 0x01 little endian, the BOM in this case dictates the endianness of the encoding (whereas in UTF-8 it's kind of just a placeholder, it is sometimes considered optional).

I think that UTF-8 being so predominant on the web and since UTF-16 makes up such a small subset of cases, that it might not be worth supporting other encodings but yeah, validating that something is UTF-8 before trying to read the bytes using parser x would probably be good. I can make a PR

asticode · 2024-10-03T13:46:29Z

A PR would be great ! 👍

nakkamarra · 2024-10-03T16:24:22Z

OK, I've got a working idea of how I want to do in this draft, let me know what you think. I am unsure about the other file formats supported here and if they also expect input to be UTF-8, but probably safe to do to those as well if you agree.

nakkamarra · 2024-10-03T16:28:42Z

Another good call out would be that we can probably perform the check on the first line of the file, as the bytes of the first line will either be valid or invalid, signalling that the rest of the file is either totally invalid, or most likely valid. Thoughts?

asticode · 2024-10-04T07:10:09Z

I'd rather create one global unexported method isValidUTF8Reader(i io.Reader) bool which would wrap the input reader into a bufio.Reader and use its .Peek() method to get the first bytes and asses whether it's valid UTF-8 🤔

We would then use this method for srt and webvtt. I'll be honest, I'm not sure about the other sub formats as well, but I'd rather not add this method if it's not useful, therefore I'd rather add it if we get concrete issues with non-utf8 with those other formats.

nakkamarra · 2024-10-04T13:33:07Z

@asticode Cool, that sounds good let me update my PR

nakkamarra · 2024-10-04T14:15:22Z

FWIW I also tested a TTML file that isn't UTF-8 encoded and it errors at the xml parser level:
astisub: xml decoding failed: XML syntax error on line 1: invalid UTF-8

nakkamarra · 2024-10-05T00:37:38Z

@asticode Yeah after some testing that actually won't work I think. Since the ReadFrom* functions' reader arguments are just an io.Reader type, the implication is that they cannot be read multiple times (i.e imagine the reader is a TCP socket or some other network stream that just sends bytes) and so we can't created a bufio Reader from it and then also create the bufio Scanner that VTT / SRT ReadFrom* functions are using. So I'm not too sure what to do, all the solutions I'm thinking of would basically require copying the contents of the reader into memory like with something likes bytes.Buffer (which, with something like a ~500MB-1GB file or read stream, would make astisub way too memory hungry)

asticode · 2024-10-06T09:33:47Z

Good point 👍 I've checked out the xml package and it actually performs the check on each provided text inputs. I think the best alternative would be to go back to your previous idea, and use utf8.Valid() whenever needed. The only remaining question in my mind is whether to run this check on the whole lines provided by the scanners, or afterwards? 🤔 The first option is simpler but I want to make sure it captures all the cases

nakkamarra · 2024-10-07T02:11:35Z

I've updated my branch to go back to testing each line to be valid UTF-8 in these readers, and I've added benchmark tests and ran them against my branch vs. master and they seem to be within error.

master:

go test -bench=.
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: unknown section: [Unknown]
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: unknown section: [Unknown]
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:05:02 astisub: found another voice name "Bob" in "<v Joe>Joe says something</v> <v Bob>Bob says something</v>". Ignore
goos: darwin
goarch: arm64
pkg: github.com/asticode/go-astisub
BenchmarkOpenSRT-12                81859             12394 ns/op
BenchmarkOpenWebVTT-12             58040             19924 ns/op
PASS
ok      github.com/asticode/go-astisub  2.774s

utf-8-detection:

go test -bench=.   
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: unknown section: [Unknown]
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: unknown section: [Unknown]
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: not understood: 'Not understood line', ignoring
2024/10/06 22:06:29 astisub: found another voice name "Bob" in "<v Joe>Joe says something</v> <v Bob>Bob says something</v>". Ignore
goos: darwin
goarch: arm64
pkg: github.com/asticode/go-astisub
BenchmarkOpenSRT-12                81410             12321 ns/op
BenchmarkOpenWebVTT-12             58173             20664 ns/op
PASS
ok      github.com/asticode/go-astisub  2.822s

asticode · 2024-10-07T10:46:32Z

Can you provide benchmarks with the open file operation outside of the benchmark loop? Something like

// Read file
// Loop
  // Read srt/webvtt with content of the file as []byte

nakkamarra · 2024-10-07T14:17:39Z

master:

BenchmarkOpenSRT-12              1412808               814.1 ns/op
BenchmarkOpenWebVTT-12           1438155               844.0 ns/op

utf-8-detection:

BenchmarkOpenSRT-12              1404628               842.6 ns/op
BenchmarkOpenWebVTT-12           1410850               844.6 ns/op

nakkamarra · 2024-10-07T14:50:45Z

@asticode OK I've added some UTF-16LE test files. The github UI doesn't seem to display them differently in any way but test asserting an error from them passes and you can check them yourself:

> file -I go-astisub/testdata/example-in-non-utf8.vtt
go-astisub/testdata/example-in-non-utf8.vtt: text/plain; charset=utf-16le

> file -I go-astisub/testdata/example-in-non-utf8.srt
go-astisub/testdata/example-in-non-utf8.srt: text/plain; charset=utf-16le

nakkamarra mentioned this issue Oct 3, 2024

Enforce UTF-8 encoding for input bytes #115

Merged

nakkamarra closed this as completed Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear opinion about encodings #114

Unclear opinion about encodings #114

nakkamarra commented Oct 2, 2024

asticode commented Oct 3, 2024

nakkamarra commented Oct 3, 2024

asticode commented Oct 3, 2024

nakkamarra commented Oct 3, 2024

nakkamarra commented Oct 3, 2024

asticode commented Oct 4, 2024

nakkamarra commented Oct 4, 2024

nakkamarra commented Oct 4, 2024

nakkamarra commented Oct 5, 2024

asticode commented Oct 6, 2024

nakkamarra commented Oct 7, 2024

asticode commented Oct 7, 2024

nakkamarra commented Oct 7, 2024

nakkamarra commented Oct 7, 2024

Unclear opinion about encodings #114

Unclear opinion about encodings #114

Comments

nakkamarra commented Oct 2, 2024

asticode commented Oct 3, 2024

nakkamarra commented Oct 3, 2024

asticode commented Oct 3, 2024

nakkamarra commented Oct 3, 2024

nakkamarra commented Oct 3, 2024

asticode commented Oct 4, 2024

nakkamarra commented Oct 4, 2024

nakkamarra commented Oct 4, 2024

nakkamarra commented Oct 5, 2024

asticode commented Oct 6, 2024

nakkamarra commented Oct 7, 2024

asticode commented Oct 7, 2024

nakkamarra commented Oct 7, 2024

nakkamarra commented Oct 7, 2024