perf: replace default engine JSON reader's `FileStream` with concurrent futures #711

zachschuermann · 2025-02-21T05:12:16Z

What changes are proposed in this pull request?

The original FileStream API, though intended to concurrently make GET requests to the object store, actually made serial requests and relied on a hand-written poll function in order to implement Stream. This PR aims to make a minimal change in order to (1) increase performance for the JSON reader by issuing concurrent GET requests and (2) simplify the code by removing the need for a custom Stream and instead leverage existing functions/adapters to convert the files to read into a Stream and issue concurrent requests through the futures::stream::buffered adapter.

This is effectively a similar improvement as in #595 but for the JSON reader.

Specifically, the changes are:

replace the FileStream::new_async_read_iterator() call (the manually-implemented Stream) with an inline implementation of converting the files slice into a Stream (via stream::iter) and use the futures::stream::buffered adapter to concurrently execute file opening futures. It then sends results across an mpsc channel to bridge the async/sync gap.
JsonOpener no longer implements FileOpener (which requires a synchronous fn open() and instead directly exposes an async fn open() for easier/simpler use above. This removes all reliance on FileStream/FileOpener in the JSON reader.
adds a custom ObjectStore implementation: OrderedGetStore to deterministically control the ordering in which GET request futures are resolved

How was this change tested?

added test with a new OrderedGetStore which will resolve the GET requests in a jumbled order but we expect the test to return the natural order of requests. in a additionally, manually validated that we went from serial JSON file reads to concurrent reads

codecov · 2025-02-21T05:14:29Z

Codecov Report

Attention: Patch coverage is 79.94186% with 69 lines in your changes missing coverage. Please review.

Project coverage is 83.99%. Comparing base (4c00de4) to head (1e6f746).

Files with missing lines	Patch %	Lines
kernel/src/engine/default/json.rs	79.94%	62 Missing and 7 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #711      +/-   ##
==========================================
- Coverage   84.02%   83.99%   -0.04%     
==========================================
  Files          77       77              
  Lines       18063    18349     +286     
  Branches    18063    18349     +286     
==========================================
+ Hits        15178    15412     +234     
- Misses       2167     2217      +50     
- Partials      718      720       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kernel/src/engine/default/json.rs

…json

…to concurrent-json

kernel/src/engine/default/json.rs

zachschuermann · 2025-02-25T23:10:57Z

wish the open function could still be a bit simpler, but let's not spin our wheels on that too much. This mostly looks good, just a couple of small things.

yea i do too but mostly just looks like a copypasta of the arrow_json way of async stream decoding

scovich

Sorry, somehow forgot to post this yesterday :(

scovich · 2025-02-25T17:37:55Z

kernel/src/engine/default/json.rs

-            readahead: 10,
-            batch_size: 1024,
+            readahead: 1000,
+            batch_size: 1024 * 128,


How does this batch size influence behavior?

AFAIK, the vast majority of Delta commits are tiny -- just a few file actions -- and so a large batch size may not be especially helpful in the common case. It may also not hurt, depending on how it's used, hence the question.

Note that we do expect to see absurdly massive Delta commits on occasion -- tens of GB or more -- if e.g. a big CREATE [OR REPLACE] TABLE AS operation commits.

After a bit of digging, this is the number of rows that we decode at once (keep resident in memory) until we yield a batch. I've updated doc comments throughout but TLDR it's just a limit on the number of rows in each output batch

kernel/src/engine/default/json.rs

scovich · 2025-02-25T17:42:32Z

kernel/src/engine/default/json.rs

+                // check err?
+                let _ = tx.send(item);


Right now, we're just sending all results to the receiver -- errors and all?

I think I've updated this since your review - yea we send errors over channel, this just now warn!s if send returns an error (which would mean no one is listening on the other end)

scovich · 2025-02-25T17:44:54Z

kernel/src/engine/default/json.rs

+
+            let mut stream = stream::iter(file_futures)
+                .buffered(readahead)
+                .try_flatten()


What does try_flatten do? And where is it defined/documented?
(my google-fu is apparently weak today)

Ah, found it -- TryStreamExt::try_flatten (not to be confused with TryFutureExt::try_flatten).

So open returns a future (whose Ok result is a stream) and try_flatten effectively concatenates all those streams into a single stream, but preserving any Err results?

And this is the key to preserving order, because each stream is ordered within its file, and the flattened stream guarantees that

each individual stream will get exhausted before moving on to the next

yes exactly! (and I'll document this more in line)

the key to ordering is both buffered and the try_flatten both combinators on the stream which each retain ordering

scovich · 2025-02-25T17:56:05Z

kernel/src/engine/default/json.rs

+            GetResultPayload::File(file, _) => {
+                let reader = ReaderBuilder::new(schema)
+                    .with_batch_size(batch_size)
+                    .build(BufReader::new(file))?;


If you can see this, consider hiding whitespace:

kernel/src/engine/default/json.rs

scovich · 2025-02-27T07:01:33Z

kernel/src/engine/default/json.rs

+            })
+            .collect();
+
+        let _ = future::join_all(handles).await;


Do we actually need this call if we use mpsc::IntoIter?

This iterator will block whenever next is called, waiting for a new message, and None will be returned if the corresponding channel has hung up.

(see above)

hm I ran into some odd behavior without it - it looks like we just don't wait on any of the spawned tasks to finish and then we 'finish' the test without actually doing anything. can look into this more deeply later :)

Yeah, we probably need to do what the actual json read code is doing, and produce a flattened stream of futures.

kernel/src/engine/default/json.rs

scovich · 2025-02-27T07:08:42Z

kernel/src/engine/default/json.rs

        }
    }

-    /// Set the maximum number of batches to read ahead during [Self::read_json_files()].
+    /// Deprecated: use [Self::with_buffer_size()].


Trying to avoid a breaking change or something?

yep exactly - I may just collect some of these "we need breaking change sometime" into an issue and then whenever we decide to pursue 0.8 (and have actual breaking changes need) then we can remove some of these deprecated functions

kernel/src/engine/default/json.rs

scovich · 2025-02-27T07:20:09Z

kernel/src/engine/default/json.rs

+        // note: join_all is ordered
+        let files = future::join_all(file_futures).await;


it may be ordered, but it also materializes the entire list up front (and could cause silent data loss if the mpsc overflows).

Is there not a way to try-flatten the streams into a single stream that we then convert to a blocking iterator?

That said -- I don't think this test actually adds any value over the new test that leverages the ordered object store. Two items is too few to reliably catch races, and if there were a race, we don't want a test that only notices some of the time.

I think as long as we have tested that our stream machinery preserves order, and verified that the json reads return correct data at all, probably don't need much or any testing for the combination of the two?

Put another way -- what code path(s) does this test exercise, that other tests did not cover?

Also -- what does it mean for join_all to be "ordered" in the first place? I thought spawn kicked off the tasks independently, and so they could complete in any order even if nobody ever joins on them?

(working on making this test a better one than just the two items it had before)

for context on both tests:

test_ordered_get_store is just a test to validate that our special OrderedGetStore does the right thing

test_read_json_files_ordering is actually using the OrderedGetStore to set up a specific out-of-order test so that we ensure read_json_files hands things back in the correct order

Also -- what does it mean for join_all to be "ordered" in the first place? I thought spawn kicked off the tasks independently, and so they could complete in any order even if nobody ever joins on them?

regardless of using spawn or not, it means that the list of futures (JoinHandles if spawn or some other futures if not) are resolved in order - the returned files is in the original order of the list of file_futures, NOT in the order that they are resolved.

scovich

I still don't understand why the unit test behaves the way it does, but the logic in the actual json reader looks correct.

kernel/src/engine/default/json.rs

scovich · 2025-02-27T22:25:43Z

kernel/src/engine/default/json.rs

+            })
+            .collect();
+
+        let _ = future::join_all(handles).await;


Yeah, we probably need to do what the actual json read code is doing, and produce a flattened stream of futures.

kernel/src/engine/default/json.rs

hntd187 · 2025-02-28T15:53:52Z

kernel/src/engine/default/json.rs

@@ -2,19 +2,22 @@

 use std::io::BufReader;
 use std::ops::Range;
-use std::sync::Arc;
-use std::task::{ready, Poll};
+use std::sync::{mpsc, Arc};


Consider tokio::sync::mpsc instead? much faster, designed to be used in async context

we don't depend on tokio in the default except for implementing executors in terms of it. it might be fine, but for now we can stay stdlib

hntd187 · 2025-02-28T16:21:01Z

kernel/src/engine/default/json.rs

+            let result = self.inner.get(location).await;
+
+            // we implement a future which only resolves once the requested path is next in order
+            future::poll_fn(move |cx| {


What happens when one slow task is at the front of the line? Everything just waits for that right? I think in an ideal network situation this works fine, but if one slow future is at the front it seems like this just log jams the entire process.

Yeah, but what we're simulating here is a specific ordering of data returned. We're not trying to check if things are performant or anything. So if there's a "slow" request in this case, it implies that all the other requests must be slower, since we've specified the order they should return up front.

Really this is a test for "can kernel handle it when async stuff returns out of order".

In particular: kernel's log replay requires that results come back in the order they were requested in, not the order they completed in. That's a correctness constraint. And yes, if there's a straggler at the head of the queue (in real life) that does mean everybody else is waiting. I would hope the async machinery still allows the tasks deeper in the queue to make progress meanwhile.

This test is forcing out of order completion to ensure the results are still returned in order.

hntd187 · 2025-02-28T16:22:01Z

kernel/src/engine/default/json.rs

+    }
+
+    #[tokio::test(flavor = "multi_thread", worker_threads = 3)]
+    async fn test_read_json_files_ordering() {


I think it might be helpful to have a test that exceeds the buffering limit.

nicklan

just a couple of small things but basically lgtm

nicklan · 2025-02-28T18:36:24Z

kernel/Cargo.toml

@@ -159,3 +159,4 @@ tracing-subscriber = { version = "0.3", default-features = false, features = [
  "env-filter",
  "fmt",
 ] }
+async-trait = "0.1" # only used for our custom SlowGetStore ObjectStore implementation


nit: keep alphabetical.

moved up, to after our path-based deps but on top of others, though they aren't in order it doesn't look like..

hah right, we should actually alphabatize those at some point :)

nicklan · 2025-02-28T18:37:21Z

kernel/src/engine/default/json.rs

@@ -2,19 +2,22 @@

 use std::io::BufReader;
 use std::ops::Range;
-use std::sync::Arc;
-use std::task::{ready, Poll};
+use std::sync::{mpsc, Arc};


we don't depend on tokio in the default except for implementing executors in terms of it. it might be fine, but for now we can stay stdlib

nicklan · 2025-02-28T18:46:11Z

kernel/src/engine/default/json.rs

+                    state.ordered_keys.pop_front().unwrap();
+
+                    // there are three possible cases, either:
+                    // 1. the next key has a waker already registered, in which case we wake it up


nit: maybe note that this is the case where something has already requested the next key in line, so that's why there is a waker waiting, and we need to wake it up

added more!

nicklan · 2025-02-28T18:48:08Z

kernel/src/engine/default/json.rs

+            let result = self.inner.get(location).await;
+
+            // we implement a future which only resolves once the requested path is next in order
+            future::poll_fn(move |cx| {


Yeah, but what we're simulating here is a specific ordering of data returned. We're not trying to check if things are performant or anything. So if there's a "slow" request in this case, it implies that all the other requests must be slower, since we've specified the order they should return up front.

Really this is a test for "can kernel handle it when async stuff returns out of order".

nicklan · 2025-02-28T18:48:26Z

kernel/src/engine/default/json.rs

+            let result = self.inner.get(location).await;
+
+            // we implement a future which only resolves once the requested path is next in order
+            future::poll_fn(move |cx| {


Why not just have this return Poll::Ready(result)?

poll_fn takes an FnMut so it would require us being able to either (1) capture the result into the closure multiple times (impossible - would move multiple times) or (2) we would have to just directly do the self.inner.get inside the poll_fn which I think is also difficult since poll_fn is synchronous and we want to be able to .await.

let me know if I'm missing something but i played with it for a second and came up with those items!

scovich · 2025-02-28T22:53:41Z

kernel/src/engine/default/json.rs

-        fn new(inner: T, ordered_keys: impl Into<VecDeque<Path>>) -> Self {
-            let ordered_keys = ordered_keys.into();
+        fn new(inner: T, ordered_keys: &[Path]) -> Self {
+            let ordered_keys: Vec<Path> = ordered_keys.to_vec();


Probably don't need the type annotation?

nicklan and others added 6 commits February 20, 2025 16:00

let's see

ea417b5

support --all-features again

504b20c

revert

c217abc

workflows use --all-features again

8926729

Merge branch 'main' into fix-semvar-check

04cdd68

wip: simple buffered streams

4aa86aa

zachschuermann requested review from nicklan and scovich February 21, 2025 05:12

github-actions bot assigned zachschuermann Feb 21, 2025

github-actions bot added the breaking-change Change that will require a version bump label Feb 21, 2025

scovich reviewed Feb 21, 2025

View reviewed changes

kernel/src/engine/default/json.rs Outdated Show resolved Hide resolved

zachschuermann and others added 6 commits February 21, 2025 07:10

into_iter

b0869fc

Merge remote-tracking branch 'upstream/main' into concurrent-json

ffed827

cleaner selection + readme

0cd0cf3

also for parquet.rs

85ebcb8

add a need_arrow flag

ae8c559

Merge branch 'main' into fix-semvar-check

22394c8

zachschuermann commented Feb 24, 2025

View reviewed changes

kernel/src/engine/default/json.rs Outdated Show resolved Hide resolved

Merge remote-tracking branch 'nick/fix-semvar-check' into concurrent-…

ce2667f

…json

zachschuermann removed the breaking-change Change that will require a version bump label Feb 24, 2025

zachschuermann added 2 commits February 24, 2025 14:56

Merge branch 'main' into concurrent-json

1f2f79c

Merge remote-tracking branch 'refs/remotes/origin/concurrent-json' in…

9862a3a

…to concurrent-json

github-actions bot added the breaking-change Change that will require a version bump label Feb 24, 2025

nicklan reviewed Feb 24, 2025

View reviewed changes

kernel/src/engine/default/json.rs Outdated Show resolved Hide resolved

nicklan reviewed Feb 25, 2025

View reviewed changes

kernel/src/engine/default/json.rs Outdated Show resolved Hide resolved

zachschuermann added 4 commits February 24, 2025 21:22

Merge remote-tracking branch 'upstream/main' into concurrent-json

bc16927

make Json opener async fn and add test

971ed43

fmt

1a14f90

fix comments

811cc2e

zachschuermann requested a review from nicklan February 25, 2025 23:00

address feedback

5ab75c4

add with_buffer_size and deprecate the readahead one

f6f5729

scovich reviewed Feb 26, 2025

View reviewed changes

add deterministic test via OrderedGetStore

df7a819

github-actions bot added the breaking-change Change that will require a version bump label Feb 27, 2025

zachschuermann requested a review from scovich February 27, 2025 01:18

zachschuermann added 2 commits February 26, 2025 21:37

clean up imports

f0270c7

combine keys and wakers under one lock

14e1288

scovich reviewed Feb 27, 2025

View reviewed changes

zachschuermann added 2 commits February 27, 2025 14:17

address feedback

968926a

better test_read_json_files_ordering

baf04d0

zachschuermann requested a review from scovich February 28, 2025 06:49

fix docs

e05fdfa

scovich approved these changes Feb 28, 2025

View reviewed changes

hntd187 reviewed Feb 28, 2025

View reviewed changes

nicklan approved these changes Feb 28, 2025

View reviewed changes

zachschuermann added 4 commits February 28, 2025 11:49

revert small changes

5063f8c

address feedback

3316d8d

comment

2cff468

add small buffer test

751e582

zachschuermann removed the breaking-change Change that will require a version bump label Feb 28, 2025

github-actions bot added the breaking-change Change that will require a version bump label Feb 28, 2025

flatmap

69284d2

scovich reviewed Feb 28, 2025

View reviewed changes

zachschuermann added 2 commits February 28, 2025 14:55

fix

8551bca

Merge branch 'main' into concurrent-json

1e6f746

		// note: join_all is ordered
		let files = future::join_all(file_futures).await;

perf: replace default engine JSON reader's FileStream with concurrent futures #711

Are you sure you want to change the base?

perf: replace default engine JSON reader's FileStream with concurrent futures #711

Conversation

zachschuermann commented Feb 21, 2025 • edited Loading

What changes are proposed in this pull request?

Specifically, the changes are:

How was this change tested?

codecov bot commented Feb 21, 2025 • edited Loading

Codecov Report

zachschuermann commented Feb 25, 2025

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

perf: replace default engine JSON reader's `FileStream` with concurrent futures #711

perf: replace default engine JSON reader's `FileStream` with concurrent futures #711

zachschuermann commented Feb 21, 2025 •

edited

Loading

codecov bot commented Feb 21, 2025 •

edited

Loading

scovich Feb 28, 2025 •

edited

Loading