Data Explorer: Compute column profiles in smaller batches for large data frames, avoid timeouts #5148

wesm · 2024-10-23T22:12:12Z

Addresses #4629 and #2851. This fixes timeouts that were occurring, making it seem like the data explorer is broken on large datasets, while improving the performance / responsiveness by computing updates incrementally rather than in a monolithic batch which may take 10s of seconds to return.

I found that if we spam the backend with all of the profile requests at once, the get_data_values request doesn't get served because of all the thread contention, so this just has one profile request active a time.

You can see what a 33M row dataset looks like on the initial open and then various filtering workflows

Screencast.from.2024-10-23.17-09-33.mp4

This does make the lack of a loading indicator somewhat more pronounced, but that can be addressed separately.

If you watch the end of the video, you can see that there is a bug where after the profiles are all computed, the cache is cleared and they are computed again. This bug appears to be present even without this change so I will open another issue (see #5150).

… than in a monolithic batch

wesm · 2024-10-24T16:02:58Z

Latest change with batches of 4 (which seems to strike a good balance of keeping the backend busy without doing too much work at once):

Screencast.from.2024-10-24.11-01-33.mp4

The double compute bug is visible here (and not caused by these changes) so I'll take a closer look at that

#5156) Attempts to address #5150. In the backend-state-updated event handler, the profiles were being refreshed only before `fetchData` was called with an argument to invalidate all caches including the profiles, so the profiles have to immediately be recomputed. This resolves the double computation that I observed in #5148.

wesm · 2024-10-24T20:36:01Z

Merging this, we can keep improving the small details in follow up work

wesm added 2 commits October 23, 2024 16:56

For very large data frames, compute column profiles one by one rather…

fa1a591

… than in a monolithic batch

Send requests one at a time

0aff1ca

wesm requested a review from softwarenerd October 23, 2024 22:12

wesm mentioned this pull request Oct 23, 2024

Filtering in data explorer causes column profiles to be computed twice instead of only once #5150

Closed

wesm added 2 commits October 23, 2024 17:28

Simpler code

5bfdc36

Compute profiles in batches of 4 as a compromise

1fdf953

wesm changed the title ~~Data Explorer: Compute column profiles one by one for > 10M row data frames, avoid timeouts~~ Data Explorer: Compute column profiles in smaller batches for large data frames, avoid timeouts Oct 24, 2024

wesm mentioned this pull request Oct 24, 2024

Data Explorer: Fix profiles computing twice after backend state change #5156

Merged

wesm merged commit 431a76e into main Oct 24, 2024
3 checks passed

wesm deleted the bug/de-get-column-profiles-perf-timeouts branch October 24, 2024 20:36

github-actions bot locked and limited conversation to collaborators Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Explorer: Compute column profiles in smaller batches for large data frames, avoid timeouts #5148

Data Explorer: Compute column profiles in smaller batches for large data frames, avoid timeouts #5148

wesm commented Oct 23, 2024 •

edited

Loading

wesm commented Oct 24, 2024

wesm commented Oct 24, 2024

Data Explorer: Compute column profiles in smaller batches for large data frames, avoid timeouts #5148

Data Explorer: Compute column profiles in smaller batches for large data frames, avoid timeouts #5148

Conversation

wesm commented Oct 23, 2024 • edited Loading

wesm commented Oct 24, 2024

wesm commented Oct 24, 2024

wesm commented Oct 23, 2024 •

edited

Loading