Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jaq Unexpectedly Fails to Parse Large JSON Files While jq Does Not #255

Closed
ETCaton opened this issue Jan 15, 2025 · 5 comments
Closed

jaq Unexpectedly Fails to Parse Large JSON Files While jq Does Not #255

ETCaton opened this issue Jan 15, 2025 · 5 comments
Labels
good first issue Good for newcomers

Comments

@ETCaton
Copy link

ETCaton commented Jan 15, 2025

I ran into this while using RechatTool to download and process Twitch VOD logs. An example to reproduce using a VOD from https://www.twitch.tv/alveussanctuary

❯ jaq --version
jaq 2.1.0
❯ dotnet RechatTool.dll -D 2353101492
❯ eza -lah 2353101492.json
Permissions Size User    Date Modified Name
.rw-r--r--@  15M jaqtester 14 Jan 20:22  2353101492.json
❯ wc -l 2353101492.json
  590649 2353101492.json
❯ jaq ".[].commenter.displayName" 2353101492.json
Error: failed to parse: value expected
❯ jq ".[].commenter.displayName" 2353101492.json | head -5
"jojoyy88"
"Fossabot"
"ZEEroable"
"foszki"
"jojoyy88"

whereas a similarly formatted test file works fine

❯ bat foo.json
[
  {
    "commenter": {
      "displayName": "test"
    }
  }
]
❯ jaq ".[].commenter.displayName" foo.json
"test"
❯ jq ".[].commenter.displayName" foo.json
"test"

While not perfect, this might help?

❯ git diff
diff --git a/jaq/src/main.rs b/jaq/src/main.rs
index bd9db5e..6bad138 100644
--- a/jaq/src/main.rs
+++ b/jaq/src/main.rs
@@ -279,7 +279,7 @@ where
 }
 
 fn read_slice<'a>(cli: &Cli, slice: &'a [u8]) -> Box<dyn Iterator<Item = io::Result<Val>> + 'a> {
-    if cli.raw_input {
+    if dbg!(cli.raw_input) {
         let read = io::BufReader::new(slice);
         Box::new(raw_input(cli.slurp, read).map(|r| r.map(Val::from)))
     } else {
@@ -395,7 +395,7 @@ fn run(
     let ctx = Ctx::new(vars, &iter);
 
     for item in if cli.null_input { &null } else { &iter } {
-        let input = item.map_err(Error::Parse)?;
+        let input = dbg!(item).unwrap();
         //println!("Got {:?}", input);
         for output in filter.run((ctx.clone(), input)) {
             let output = output.map_err(Error::Jaq)?;

❯ RUST_BACKTRACE=1 cargo run ".[].commenter.displayName" 2353101492.json
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.05s
     Running `target/debug/jaq '.[].commenter.displayName' 2353101492.json`
[jaq/src/main.rs:282:8] cli.raw_input = false
[jaq/src/main.rs:398:21] item = Err(
    "value expected",
)

thread 'main' panicked at jaq/src/main.rs:398:32:
called `Result::unwrap()` on an `Err` value: "value expected"
stack backtrace:
   0: rust_begin_unwind
             at /rustc/fe9b9751fa54a5871b87cd36a582f9b7b06123fd/library/std/src/panicking.rs:692:5
   1: core::panicking::panic_fmt
             at /rustc/fe9b9751fa54a5871b87cd36a582f9b7b06123fd/library/core/src/panicking.rs:75:14
   2: core::result::unwrap_failed
             at /rustc/fe9b9751fa54a5871b87cd36a582f9b7b06123fd/library/core/src/result.rs:1704:5
   3: core::result::Result<T,E>::unwrap
             at /Users/jaqtester/.rustup/toolchains/beta-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/result.rs:1109:23
   4: jaq::run
             at ./jaq/src/main.rs:398:21
   5: jaq::real_main::{{closure}}
             at ./jaq/src/main.rs:115:21
   6: jaq::with_stdout
             at ./jaq/src/main.rs:522:5
   7: jaq::real_main
             at ./jaq/src/main.rs:114:24
   8: jaq::main
             at ./jaq/src/main.rs:56:11
   9: core::ops::function::FnOnce::call_once
             at /Users/jaqtester/.rustup/toolchains/beta-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

The above project requires soon-to-be-deprecated .NET 6 so let me know if I should provide the JSON in question; I just am hesitant about posting technically public but hard to find things like "all messages in a VOD" on a random GitHub issue 🙃.

@01mf02
Copy link
Owner

01mf02 commented Jan 15, 2025

Hi @ETCaton, thanks for your message!

Given your taste for dbg!, I have prepared a small sample that should print the byte offset of the parse error:

diff --git a/jaq/src/main.rs b/jaq/src/main.rs
index bd9db5e..63110b9 100644
--- a/jaq/src/main.rs
+++ b/jaq/src/main.rs
@@ -250,7 +250,10 @@ fn json_slice(slice: &[u8]) -> impl Iterator<Item = io::Result<Val>> + '_ {
     let mut lexer = hifijson::SliceLexer::new(slice);
     core::iter::from_fn(move || {
         use hifijson::token::Lex;
-        Some(Val::parse(lexer.ws_token()?, &mut lexer).map_err(invalid_data))
+        Some(Val::parse(lexer.ws_token()?, &mut lexer).map_err(|e| {
+            dbg!(lexer.as_slice().as_ptr() as usize - slice.as_ptr() as usize);
+            invalid_data(e)
+        }))
     })
 }

This should help you pinpoint the source of the error. Posting the JSON file is not necessary (but I would be interested in the cause of the error nonetheless).

And yes, I think that jaq should show the error position automatically, perhaps with codesnake. I think this would make for a nice first issue.

@01mf02 01mf02 added the good first issue Good for newcomers label Jan 15, 2025
@ETCaton
Copy link
Author

ETCaton commented Jan 15, 2025

[jaq/src/main.rs:254:13] lexer.as_slice().as_ptr() as usize - slice.as_ptr() as usize = 1

Which kind of makes sense since that dbg! only showed it going through the iterator once(?)

I took a stab at a guess that it either isn't fully reading the file or something else isn't fully copying the file, and started a binary search on the number of elements before it causes an error which did yield something interesting. There are 15,419 elements in the original input. If I half that to 7,710 we get

Error: cannot use null as iterable (array or object)

which, after binary search, only starts at above 1,828 elements or 67,865 lines of text, neither of which are particularly close to a max value for an integer type except maybe u16 for line length given that using jq -c ".[0:1827]" works while without -c causes the error.

I'll see if I can accomplish the same with the above foo.json and post the number of (non -c formatted) elements that causes an error. Or if I can filter down some elements of the original.

@01mf02
Copy link
Owner

01mf02 commented Jan 17, 2025

[jaq/src/main.rs:254:13] lexer.as_slice().as_ptr() as usize - slice.as_ptr() as usize = 1

Which kind of makes sense since that dbg! only showed it going through the iterator once(?)

This is actually very weird and I do not understand it.
That indicates that already the very first byte in your input caused a lex or parse error. Can you tell us how you invoked jaq to yield this error message?

You can also add dbg!(String::from_utf8(lexer.as_slice().to_vec())); to the debug block I wrote above to print the whole input after the first lex/parse error. But if my hypothesis is correct, this will just print the whole input, except for the first byte.

Have you gotten anywhere bisecting your input file?

@wader
Copy link
Contributor

wader commented Jan 17, 2025

Could it be a byte order marker? it's unnecessary for utf-8 and also a "MUST NOT" according to the JSON spec but parsers are allowed to ignore it which jq do https://github.com/jqlang/jq/blob/588ff1874c8c394253c231733047a550efe78260/src/jv_parse.c#L733-L748

$ echo -e '\xef\xbb\xbf123' | jq
123

$ echo -e '\xef\xbb\xbf123' | jaq
Error: failed to parse: value expected

$ echo -e '\xef\xbb\xbf123' | gojq
gojq: invalid json: <stdin>
    123
    ^  invalid character 'ï' looking for beginning of value

# also you can try this to cut away the BOM
$ echo -e '\xef\xbb\xbf213' | cut -b 4- | jaq
213

@ETCaton
Copy link
Author

ETCaton commented Jan 17, 2025

Yep that does appear to be it! A cursory look into C#-land seems to imply that it sort of automatically creates it for UTF-8(?), but I'll file an issue upstream: https://stackoverflow.com/questions/5266069/streamwriter-and-utf-8-byte-order-marks

@01mf02 @wader thanks for the help debugging! I think this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants