Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Explorer: Compute column profiles in smaller batches for large data frames, avoid timeouts #5148

Merged
merged 4 commits into from
Oct 24, 2024

Conversation

wesm
Copy link
Contributor

@wesm wesm commented Oct 23, 2024

Addresses #4629 and #2851. This fixes timeouts that were occurring, making it seem like the data explorer is broken on large datasets, while improving the performance / responsiveness by computing updates incrementally rather than in a monolithic batch which may take 10s of seconds to return.

I found that if we spam the backend with all of the profile requests at once, the get_data_values request doesn't get served because of all the thread contention, so this just has one profile request active a time.

You can see what a 33M row dataset looks like on the initial open and then various filtering workflows

Screencast.from.2024-10-23.17-09-33.mp4

This does make the lack of a loading indicator somewhat more pronounced, but that can be addressed separately.

If you watch the end of the video, you can see that there is a bug where after the profiles are all computed, the cache is cleared and they are computed again. This bug appears to be present even without this change so I will open another issue (see #5150).

@wesm wesm changed the title Data Explorer: Compute column profiles one by one for > 10M row data frames, avoid timeouts Data Explorer: Compute column profiles in smaller batches for large data frames, avoid timeouts Oct 24, 2024
@wesm
Copy link
Contributor Author

wesm commented Oct 24, 2024

Latest change with batches of 4 (which seems to strike a good balance of keeping the backend busy without doing too much work at once):

Screencast.from.2024-10-24.11-01-33.mp4

The double compute bug is visible here (and not caused by these changes) so I'll take a closer look at that

wesm added a commit that referenced this pull request Oct 24, 2024
#5156)

Attempts to address #5150. In the backend-state-updated event handler,
the profiles were being refreshed only before `fetchData` was called
with an argument to invalidate all caches including the profiles, so the
profiles have to immediately be recomputed. This resolves the double
computation that I observed in #5148.
@wesm
Copy link
Contributor Author

wesm commented Oct 24, 2024

Merging this, we can keep improving the small details in follow up work

@wesm wesm merged commit 431a76e into main Oct 24, 2024
3 checks passed
@wesm wesm deleted the bug/de-get-column-profiles-perf-timeouts branch October 24, 2024 20:36
@github-actions github-actions bot locked and limited conversation to collaborators Oct 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant