Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix long run time and reintroduce timing benchmark test #258

Merged
merged 8 commits into from
Sep 27, 2023

Conversation

brynpickering
Copy link
Contributor

@brynpickering brynpickering commented Sep 25, 2023

Fixes long memory profiling CI runtimes.

It seems that the timeout that I "fixed" by removing in #254 was actually trying to tell us something. A change in how pandas groupby works means that the current implementation on reading data became unbelievably slow. To be honest, I'm surprised that pivoting a dataframe into a nested dict was applied for "performance" reasons to begin with...

Anyway, I've got rid of the df->dict conversion. Locally, this not only makes it possible to run the timing/memory benchmark (CI was taking hours for that test!) but actually halves the time it takes to run! Will be interesting to see if this leads to real speed-ups in PAM in more practical applications.

EDIT: this update leads to slower runtimes than the previous method after @Theodore-Chatziioannou tested it with a real dataset (10-20% slower). We should fix this and the profiling dataset so that it is catching runtimes that align better with real PAM usecases.

UPDATE: memory requirements also half when making this change. The benchmark is now ~1.4GB instead of ~2.8GB.

Checklist

Any checks which are not relevant to the PR can be pre-checked by the PR creator.
All others should be checked by the reviewer(s).
You can add extra checklist items here if required by the PR.

  • CHANGELOG updated
  • Tests added to cover contribution
  • Documentation updated

@brynpickering brynpickering mentioned this pull request Sep 25, 2023
11 tasks
memray has a time overhead that is difficult to gauge on CI runners
@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2023

Codecov Report

Merging #258 (231274e) into main (07813c8) will increase coverage by 0.69%.
Report is 135 commits behind head on main.
The diff coverage is 86.59%.

@@            Coverage Diff             @@
##             main     #258      +/-   ##
==========================================
+ Coverage   86.84%   87.53%   +0.69%     
==========================================
  Files          49       48       -1     
  Lines        5496     5746     +250     
  Branches     1372     1436      +64     
==========================================
+ Hits         4773     5030     +257     
+ Misses        462      446      -16     
- Partials      261      270       +9     
Files Coverage Δ
pam/__init__.py 100.00% <100.00%> (ø)
pam/planner/od.py 97.33% <ø> (-0.04%) ⬇️
pam/read/__init__.py 100.00% <ø> (ø)
pam/report/summary.py 93.12% <100.00%> (ø)
pam/samplers/tour.py 98.00% <100.00%> (+2.25%) ⬆️
pam/scoring.py 89.94% <100.00%> (+0.30%) ⬆️
pam/variables.py 100.00% <100.00%> (ø)
pam/write/__init__.py 100.00% <ø> (ø)
pam/activity.py 90.90% <50.00%> (+0.14%) ⬆️
pam/planner/ipf.py 98.78% <98.78%> (ø)
... and 10 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Locally it takes ~75 seconds.
On a CI runner it seems to take just over 200 seconds.
To be safe, we go for 250 seconds.
@fredshone
Copy link
Collaborator

I think Theo added this df -> dict step as a speed up N years ago. Curious to know what has happened to groupby to spoil this.

@Theodore-Chatziioannou
Copy link
Contributor

yes, let me check with a large population... Our finding back then was the opposite: that .loc was extremely slow, and using a dictionary instead gave us a performance improvement.. But this may not be the case anymore..

@brynpickering
Copy link
Contributor Author

.loc being slow seems to not be the case as of pandas v1.5.0 (see pandas-dev/pandas#23735). So it may well have been true for an old version of PAM (although the memory overhead introduced by df -> dict probably existed either way). Anyway, with new pandas the pendulum has swung the other way, so seems worth taking advantage of the benefits.

@brynpickering
Copy link
Contributor Author

RE why groupby is now so slow, it could be a regression introduced by the latest version of pandas (e.g., pandas-dev/pandas#52070), but I kinda doubt it since we're not actually operating on the data at all, just adding grouped arrays to a dictionary...

@brynpickering
Copy link
Contributor Author

Looking at where it gets stuck on df -> dict with pandas v2.1.1, this seems to be the associated issue: pandas-dev/pandas#55256

Copy link
Contributor

@Theodore-Chatziioannou Theodore-Chatziioannou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested against a large pop: loading was slightly slower to before, but not a large difference. Suggest merging and revising the indexing approach in the future.

@brynpickering brynpickering merged commit 73b9d92 into main Sep 27, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants