-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix long run time and reintroduce timing benchmark test #258
Conversation
memray has a time overhead that is difficult to gauge on CI runners
Codecov Report
@@ Coverage Diff @@
## main #258 +/- ##
==========================================
+ Coverage 86.84% 87.53% +0.69%
==========================================
Files 49 48 -1
Lines 5496 5746 +250
Branches 1372 1436 +64
==========================================
+ Hits 4773 5030 +257
+ Misses 462 446 -16
- Partials 261 270 +9
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Locally it takes ~75 seconds. On a CI runner it seems to take just over 200 seconds. To be safe, we go for 250 seconds.
I think Theo added this df -> dict step as a speed up N years ago. Curious to know what has happened to groupby to spoil this. |
yes, let me check with a large population... Our finding back then was the opposite: that |
|
RE why groupby is now so slow, it could be a regression introduced by the latest version of pandas (e.g., pandas-dev/pandas#52070), but I kinda doubt it since we're not actually operating on the data at all, just adding grouped arrays to a dictionary... |
Looking at where it gets stuck on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested against a large pop: loading was slightly slower to before, but not a large difference. Suggest merging and revising the indexing approach in the future.
Fixes long memory profiling CI runtimes.
It seems that the timeout that I "fixed" by removing in #254 was actually trying to tell us something. A change in how pandas groupby works means that the current implementation on reading data became unbelievably slow. To be honest, I'm surprised that pivoting a dataframe into a nested dict was applied for "performance" reasons to begin with...
Anyway, I've got rid of the df->dict conversion. Locally, this not only makes it possible to run the timing/memory benchmark (CI was taking hours for that test!) but actually halves the time it takes to run! Will be interesting to see if this leads to real speed-ups in PAM in more practical applications.
EDIT: this update leads to slower runtimes than the previous method after @Theodore-Chatziioannou tested it with a real dataset (10-20% slower). We should fix this and the profiling dataset so that it is catching runtimes that align better with real PAM usecases.
UPDATE: memory requirements also half when making this change. The benchmark is now ~1.4GB instead of ~2.8GB.
Checklist
Any checks which are not relevant to the PR can be pre-checked by the PR creator.
All others should be checked by the reviewer(s).
You can add extra checklist items here if required by the PR.