Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up emicorr, consider use of bottleneck package #9008

Closed
stscijgbot-jp opened this issue Dec 13, 2024 · 16 comments
Closed

Speed up emicorr, consider use of bottleneck package #9008

stscijgbot-jp opened this issue Dec 13, 2024 · 16 comments

Comments

@stscijgbot-jp
Copy link
Collaborator

Issue JP-3819 was created on JIRA by Timothy Brandt:

numpy's nanmedian and masked array operations are very slow.  The bottleneck package fixes this.  The required changes would be an added dependency (or a try/except block) and changing np.nanmedian to bn.nanmedian.  Speedups by factors of several are likely in various places.  Is there a reason not to use bottleneck?

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Timothy Brandt on JIRA:

One example: replacing masked arrays with bottleneck calculations (and replacing an instance of np.where() with boolean indexing) can speed up emicorr by a factor of 2-3 on my laptop with just 3 or 4 lines of code changes (120s -> 50s in apply_emicorr for jw01283001001_03101_00001_mirimage_uncal.fits).

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Timothy Brandt on JIRA:

See here: https://github.com/t-brandt/jwst/tree/emicorr_speedup for (I think) a working branch with a handful of changes.  I am seeing a factor of 3-4 speedup in the emicorr step.

@stscijgbot-jp stscijgbot-jp changed the title Use of bottleneck to speed up nanmedian Use of bottleneck to speed up nanmedian, emicorr Dec 20, 2024
@stscijgbot-jp
Copy link
Collaborator Author

Comment by David Law on JIRA:

Tyler Pauly Thoughts on introducing a pipeline dependency on the 'bottleneck' package?  This implements a host of long-running numpy functions as transparent C versions.  I'm hesitant to add dependencies, but the gains from this for some of our long-running steps seem clear.  Looks like jdaviz has added this recently too.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Timothy Brandt on JIRA:

Tyler Pauly There is also an intermediate option: add bottleneck as a "soft" dependency with something like the following at the beginning of routines:

import numpy as np
try:
    import bottleneck as bn
except ImportError as err:
    msg = "Please install the bottleneck package for improved performance.  Falling back on numpy.  To install, use pip install bottleneck."
    warnings.warn(msg)
    bn = np   # set bn as an alias for numpy```
I would tend to favor the hard dependency, but defer to the judgment of others.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by David Law on JIRA:

Adding a note that this was discussed at the Jan 8 JP meeting and no objections were raised to the bottleneck package, with a preference for the aliasing soft dependency approach above.

I think we can move ahead with evaluating specific PRs that use this code.

Timothy Brandt can you please submit a PR for the emicorr branch above?

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Timothy Brandt on JIRA:

#9077

Please feel free to make suggestions on exactly how the "soft" dependency is implemented.  Speedup from this PR is a factor of 4 on jw01668002001_04101_00001_mirimage (120s -> 30s) on my Mac with bottleneck installed, and by a factor of 3 (120s -> 40s) without it.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Timothy Brandt on JIRA:

The "check types" automated check on the PR is failing because bottleneck is not a requirement.  It looks to me like a "soft dependency" is not compatible with the pipeline checks.  

@stscijgbot-jp
Copy link
Collaborator Author

Comment by David Law on JIRA:

Perhaps it would be simplest then to approach this in the other direction; add bottleneck to the pyproject dependency list but keep the try/except code block.  That way everyone gets the speed gains automatically, but if we encounter any unexpected issues over the next few months there's an easy workaround at the package-management level rather than needing to edit multiple lines of code.

@braingram
Copy link
Collaborator

Thanks for looking into this and working on the improvements. I'm hesitant to adopt bottleneck for a few reasons.

  1. bottleneck is only for native endian data. Since most systems are little endian and FITS is big endian arrays read from a file will not go through bottleneck but arrays created in memory will. I doubt our regression tests cover this type of nuance and likely need to be expanded to provide any reasonable coverage of bottleneck use.
  2. bottleneck shows numerical inaccuracies for large arrays of float32 data. Catastrohpic accuracy loss in large float32 array for nanmean and nanstd pydata/bottleneck#462 This recently led astropy to stop using bottleneck for float32 BUG: (slow, steady and) correct wins the race: prefer numpy over bottleneck for nanfunctions on float32 arrays astropy/astropy#17204

If we do adopt bottleneck I suggest that we do so as a hard dependency. I don't believe the pipeline has other "soft" dependencies and adding one would introduce a new axis against which tests would need to be run doubling the number of regtests we would need to run (one with bottleneck one without).

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Timothy Brandt on JIRA:

That's a good point about byte order–that can be nasty when fits format clashes with the native byte order of a computer.  I hadn't appreciated that.  Regarding accuracy, this seems to only affect means and standard deviations.  Nanmean is so much faster than nanmedian that I would be happy just using np.nanmean and bn.nanmedian.  I think it's a valid question whether the possible byte order issue is a showstopper.  I think this could be handled by a jwst pipeline utility that checks for byte order and the presence of bottleneck and then chooses either bn.nanmedian or np.nanmedian, as appropriate.  It's an extra function, though.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Timothy Brandt on JIRA:

Thinking about this a little more in light of Brett Graham's github comment on a regtest failing, I think that this is an argument in favor of requiring bottleneck.  I don't think there is any guarantee when running regtests that a user will not have bottleneck installed, right?  I do, after all!  So the user may or may not experience a failure, depending on the configuration of their environment in a way not imposed by the dependencies?

@braingram
Copy link
Collaborator

How much is emicorr sped up with the changes introduced in #9077 without using bottleneck?

My vote is to not include bottleneck as a dependency given the numerical accuracy issue, the lack of non-native endian support, and the related astropy issue.

@melanieclarke
Copy link
Collaborator

Thanks for reviewing this, @braingram. I was not aware that astropy had removed some support for bottleneck recently, or that bottleneck has not had maintainers for a while. Given these issues, I agree with you - I'm not sure it's worth pulling in as a dependency in its current state.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by David Law on JIRA:

Thanks Brett Graham for digging into this; based on your findings I'd agree that we should not incorporate bottleneck at the present time.  Speed gains are good, but not at the risk of accuracy or repeatability.  If the package gets dusted off at some point we can reconsider later.

Timothy Brandt Presumably the factor of 3 gains that we get without bottleneck is mostly from not copying the entire large data array?

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Timothy Brandt on JIRA:

Thank you all; I have removed bottleneck.  There is still a factor of 3 speedup (rather than a factor of 4).  David Law: nesting the copy of data substantially reduces the memory footprint, but the speedup comes from using boolean indexing rather than np.where here

https://github.com/t-brandt/jwst/blob/1b8d7584936ed9b34a2fd16725c7a2ada8d5fac1/jwst/emicorr/emicorr.py#L389

and using array slicing instead of advanced indexing here

https://github.com/t-brandt/jwst/blob/1b8d7584936ed9b34a2fd16725c7a2ada8d5fac1/jwst/emicorr/emicorr.py#L458

 

@stscijgbot-jp stscijgbot-jp changed the title Use of bottleneck to speed up nanmedian, emicorr Speed up emicorr, consider use of bottleneck package Jan 24, 2025
@stscijgbot-jp
Copy link
Collaborator Author

Comment by Melanie Clarke on JIRA:

Fixes not including bottleneck merged from:

#9077

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants