jaq Unexpectedly Fails to Parse Large JSON Files While jq Does Not #255

ETCaton · 2025-01-15T04:33:45Z

I ran into this while using RechatTool to download and process Twitch VOD logs. An example to reproduce using a VOD from https://www.twitch.tv/alveussanctuary

❯ jaq --version
jaq 2.1.0
❯ dotnet RechatTool.dll -D 2353101492
❯ eza -lah 2353101492.json
Permissions Size User    Date Modified Name
.rw-r--r--@  15M jaqtester 14 Jan 20:22  2353101492.json
❯ wc -l 2353101492.json
  590649 2353101492.json
❯ jaq ".[].commenter.displayName" 2353101492.json
Error: failed to parse: value expected
❯ jq ".[].commenter.displayName" 2353101492.json | head -5
"jojoyy88"
"Fossabot"
"ZEEroable"
"foszki"
"jojoyy88"

whereas a similarly formatted test file works fine

❯ bat foo.json
[
  {
    "commenter": {
      "displayName": "test"
    }
  }
]
❯ jaq ".[].commenter.displayName" foo.json
"test"
❯ jq ".[].commenter.displayName" foo.json
"test"

While not perfect, this might help?

❯ git diff
diff --git a/jaq/src/main.rs b/jaq/src/main.rs
index bd9db5e..6bad138 100644
--- a/jaq/src/main.rs
+++ b/jaq/src/main.rs
@@ -279,7 +279,7 @@ where
 }
 
 fn read_slice<'a>(cli: &Cli, slice: &'a [u8]) -> Box<dyn Iterator<Item = io::Result<Val>> + 'a> {
-    if cli.raw_input {
+    if dbg!(cli.raw_input) {
         let read = io::BufReader::new(slice);
         Box::new(raw_input(cli.slurp, read).map(|r| r.map(Val::from)))
     } else {
@@ -395,7 +395,7 @@ fn run(
     let ctx = Ctx::new(vars, &iter);
 
     for item in if cli.null_input { &null } else { &iter } {
-        let input = item.map_err(Error::Parse)?;
+        let input = dbg!(item).unwrap();
         //println!("Got {:?}", input);
         for output in filter.run((ctx.clone(), input)) {
             let output = output.map_err(Error::Jaq)?;

❯ RUST_BACKTRACE=1 cargo run ".[].commenter.displayName" 2353101492.json
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.05s
     Running `target/debug/jaq '.[].commenter.displayName' 2353101492.json`
[jaq/src/main.rs:282:8] cli.raw_input = false
[jaq/src/main.rs:398:21] item = Err(
    "value expected",
)

thread 'main' panicked at jaq/src/main.rs:398:32:
called `Result::unwrap()` on an `Err` value: "value expected"
stack backtrace:
   0: rust_begin_unwind
             at /rustc/fe9b9751fa54a5871b87cd36a582f9b7b06123fd/library/std/src/panicking.rs:692:5
   1: core::panicking::panic_fmt
             at /rustc/fe9b9751fa54a5871b87cd36a582f9b7b06123fd/library/core/src/panicking.rs:75:14
   2: core::result::unwrap_failed
             at /rustc/fe9b9751fa54a5871b87cd36a582f9b7b06123fd/library/core/src/result.rs:1704:5
   3: core::result::Result<T,E>::unwrap
             at /Users/jaqtester/.rustup/toolchains/beta-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/result.rs:1109:23
   4: jaq::run
             at ./jaq/src/main.rs:398:21
   5: jaq::real_main::{{closure}}
             at ./jaq/src/main.rs:115:21
   6: jaq::with_stdout
             at ./jaq/src/main.rs:522:5
   7: jaq::real_main
             at ./jaq/src/main.rs:114:24
   8: jaq::main
             at ./jaq/src/main.rs:56:11
   9: core::ops::function::FnOnce::call_once
             at /Users/jaqtester/.rustup/toolchains/beta-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

The above project requires soon-to-be-deprecated .NET 6 so let me know if I should provide the JSON in question; I just am hesitant about posting technically public but hard to find things like "all messages in a VOD" on a random GitHub issue 🙃.

01mf02 · 2025-01-15T07:16:32Z

Hi @ETCaton, thanks for your message!

Given your taste for dbg!, I have prepared a small sample that should print the byte offset of the parse error:

diff --git a/jaq/src/main.rs b/jaq/src/main.rs
index bd9db5e..63110b9 100644
--- a/jaq/src/main.rs
+++ b/jaq/src/main.rs
@@ -250,7 +250,10 @@ fn json_slice(slice: &[u8]) -> impl Iterator<Item = io::Result<Val>> + '_ {
     let mut lexer = hifijson::SliceLexer::new(slice);
     core::iter::from_fn(move || {
         use hifijson::token::Lex;
-        Some(Val::parse(lexer.ws_token()?, &mut lexer).map_err(invalid_data))
+        Some(Val::parse(lexer.ws_token()?, &mut lexer).map_err(|e| {
+            dbg!(lexer.as_slice().as_ptr() as usize - slice.as_ptr() as usize);
+            invalid_data(e)
+        }))
     })
 }

This should help you pinpoint the source of the error. Posting the JSON file is not necessary (but I would be interested in the cause of the error nonetheless).

And yes, I think that jaq should show the error position automatically, perhaps with codesnake. I think this would make for a nice first issue.

ETCaton · 2025-01-15T13:02:27Z

[jaq/src/main.rs:254:13] lexer.as_slice().as_ptr() as usize - slice.as_ptr() as usize = 1

Which kind of makes sense since that dbg! only showed it going through the iterator once(?)

I took a stab at a guess that it either isn't fully reading the file or something else isn't fully copying the file, and started a binary search on the number of elements before it causes an error which did yield something interesting. There are 15,419 elements in the original input. If I half that to 7,710 we get

Error: cannot use null as iterable (array or object)

which, after binary search, only starts at above 1,828 elements or 67,865 lines of text, neither of which are particularly close to a max value for an integer type except maybe u16 for line length given that using jq -c ".[0:1827]" works while without -c causes the error.

I'll see if I can accomplish the same with the above foo.json and post the number of (non -c formatted) elements that causes an error. Or if I can filter down some elements of the original.

01mf02 · 2025-01-17T09:18:55Z

[jaq/src/main.rs:254:13] lexer.as_slice().as_ptr() as usize - slice.as_ptr() as usize = 1
Which kind of makes sense since that dbg! only showed it going through the iterator once(?)

This is actually very weird and I do not understand it.
That indicates that already the very first byte in your input caused a lex or parse error. Can you tell us how you invoked jaq to yield this error message?

You can also add dbg!(String::from_utf8(lexer.as_slice().to_vec())); to the debug block I wrote above to print the whole input after the first lex/parse error. But if my hypothesis is correct, this will just print the whole input, except for the first byte.

Have you gotten anywhere bisecting your input file?

wader · 2025-01-17T09:30:54Z

Could it be a byte order marker? it's unnecessary for utf-8 and also a "MUST NOT" according to the JSON spec but parsers are allowed to ignore it which jq do https://github.com/jqlang/jq/blob/588ff1874c8c394253c231733047a550efe78260/src/jv_parse.c#L733-L748

$ echo -e '\xef\xbb\xbf123' | jq
123

$ echo -e '\xef\xbb\xbf123' | jaq
Error: failed to parse: value expected

$ echo -e '\xef\xbb\xbf123' | gojq
gojq: invalid json: <stdin>
    123
    ^  invalid character 'ï' looking for beginning of value

# also you can try this to cut away the BOM
$ echo -e '\xef\xbb\xbf213' | cut -b 4- | jaq
213

ETCaton · 2025-01-17T16:50:29Z

Yep that does appear to be it! A cursory look into C#-land seems to imply that it sort of automatically creates it for UTF-8(?), but I'll file an issue upstream: https://stackoverflow.com/questions/5266069/streamwriter-and-utf-8-byte-order-marks

@01mf02 @wader thanks for the help debugging! I think this issue can be closed.

01mf02 added the good first issue Good for newcomers label Jan 15, 2025

ETCaton mentioned this issue Jan 17, 2025

Created JSON Appears to Have a BOM When It Should Not jdpurcell/RechatTool#16

Closed

ETCaton closed this as completed Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jaq Unexpectedly Fails to Parse Large JSON Files While jq Does Not #255

jaq Unexpectedly Fails to Parse Large JSON Files While jq Does Not #255

ETCaton commented Jan 15, 2025 •

edited

Loading

01mf02 commented Jan 15, 2025 •

edited

Loading

ETCaton commented Jan 15, 2025 •

edited

Loading

01mf02 commented Jan 17, 2025 •

edited

Loading

wader commented Jan 17, 2025 •

edited

Loading

ETCaton commented Jan 17, 2025

jaq Unexpectedly Fails to Parse Large JSON Files While jq Does Not #255

jaq Unexpectedly Fails to Parse Large JSON Files While jq Does Not #255

Comments

ETCaton commented Jan 15, 2025 • edited Loading

01mf02 commented Jan 15, 2025 • edited Loading

ETCaton commented Jan 15, 2025 • edited Loading

01mf02 commented Jan 17, 2025 • edited Loading

wader commented Jan 17, 2025 • edited Loading

ETCaton commented Jan 17, 2025

ETCaton commented Jan 15, 2025 •

edited

Loading

01mf02 commented Jan 15, 2025 •

edited

Loading

ETCaton commented Jan 15, 2025 •

edited

Loading

01mf02 commented Jan 17, 2025 •

edited

Loading

wader commented Jan 17, 2025 •

edited

Loading