Data Explorer: Compute column profiles in smaller batches for large data frames, avoid timeouts #5148
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Addresses #4629 and #2851. This fixes timeouts that were occurring, making it seem like the data explorer is broken on large datasets, while improving the performance / responsiveness by computing updates incrementally rather than in a monolithic batch which may take 10s of seconds to return.
I found that if we spam the backend with all of the profile requests at once, the
get_data_values
request doesn't get served because of all the thread contention, so this just has one profile request active a time.You can see what a 33M row dataset looks like on the initial open and then various filtering workflows
Screencast.from.2024-10-23.17-09-33.mp4
This does make the lack of a loading indicator somewhat more pronounced, but that can be addressed separately.
If you watch the end of the video, you can see that there is a bug where after the profiles are all computed, the cache is cleared and they are computed again. This bug appears to be present even without this change so I will open another issue (see #5150).