-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jaq Unexpectedly Fails to Parse Large JSON Files While jq Does Not #255
Comments
Hi @ETCaton, thanks for your message! Given your taste for diff --git a/jaq/src/main.rs b/jaq/src/main.rs
index bd9db5e..63110b9 100644
--- a/jaq/src/main.rs
+++ b/jaq/src/main.rs
@@ -250,7 +250,10 @@ fn json_slice(slice: &[u8]) -> impl Iterator<Item = io::Result<Val>> + '_ {
let mut lexer = hifijson::SliceLexer::new(slice);
core::iter::from_fn(move || {
use hifijson::token::Lex;
- Some(Val::parse(lexer.ws_token()?, &mut lexer).map_err(invalid_data))
+ Some(Val::parse(lexer.ws_token()?, &mut lexer).map_err(|e| {
+ dbg!(lexer.as_slice().as_ptr() as usize - slice.as_ptr() as usize);
+ invalid_data(e)
+ }))
})
} This should help you pinpoint the source of the error. Posting the JSON file is not necessary (but I would be interested in the cause of the error nonetheless). And yes, I think that jaq should show the error position automatically, perhaps with |
Which kind of makes sense since that I took a stab at a guess that it either isn't fully reading the file or something else isn't fully copying the file, and started a binary search on the number of elements before it causes an error which did yield something interesting. There are 15,419 elements in the original input. If I half that to 7,710 we get
which, after binary search, only starts at above 1,828 elements or 67,865 lines of text, neither of which are particularly close to a max value for an integer type except maybe u16 for line length given that using I'll see if I can accomplish the same with the above |
This is actually very weird and I do not understand it. You can also add Have you gotten anywhere bisecting your input file? |
Could it be a byte order marker? it's unnecessary for utf-8 and also a "MUST NOT" according to the JSON spec but parsers are allowed to ignore it which jq do https://github.com/jqlang/jq/blob/588ff1874c8c394253c231733047a550efe78260/src/jv_parse.c#L733-L748 $ echo -e '\xef\xbb\xbf123' | jq
123
$ echo -e '\xef\xbb\xbf123' | jaq
Error: failed to parse: value expected
$ echo -e '\xef\xbb\xbf123' | gojq
gojq: invalid json: <stdin>
123
^ invalid character 'ï' looking for beginning of value
# also you can try this to cut away the BOM
$ echo -e '\xef\xbb\xbf213' | cut -b 4- | jaq
213 |
Yep that does appear to be it! A cursory look into C#-land seems to imply that it sort of automatically creates it for UTF-8(?), but I'll file an issue upstream: https://stackoverflow.com/questions/5266069/streamwriter-and-utf-8-byte-order-marks @01mf02 @wader thanks for the help debugging! I think this issue can be closed. |
I ran into this while using RechatTool to download and process Twitch VOD logs. An example to reproduce using a VOD from https://www.twitch.tv/alveussanctuary
whereas a similarly formatted test file works fine
While not perfect, this might help?
The above project requires soon-to-be-deprecated .NET 6 so let me know if I should provide the JSON in question; I just am hesitant about posting technically public but hard to find things like "all messages in a VOD" on a random GitHub issue 🙃.
The text was updated successfully, but these errors were encountered: