Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PipelineProduct.save(): distinguish between allframes and usedframes #79

Open
oczoske opened this issue Dec 29, 2024 · 12 comments
Open
Assignees

Comments

@oczoske
Copy link
Contributor

oczoske commented Dec 29, 2024

The function cpl.dfs.save_image() takes two framesets in its argument list:

  • allframes : "The list of input frames for the recipe"
  • usedframes: "The list of raw/calibration frames used for this product"

Our method PipelineProduct.save() passes self.recipe.frameset to both arguments, which is not always appropriate. As an example, WCU templates take a number of WCU_OFF frames, that are to be combined and subtracted as background from the WCU_ON frames. It is not unreasonable to want to save the background frame as an intermediate or temporary pipeline product. In this case, usedframes should not include the WCU_ON frames as these were not used in the creation of the background image. I'm sure there are better examples among the more complex pipeline recipes.

@hugobuddel
Copy link
Contributor

Three comments to your proposal:

  1. You are right, we should ensure our framework allows the possibility of creating recipes that have output products that have such a particular data lineage. And such an optional background image might even even be a sufficient as an exception to my objection below.
  2. I would like that the headers of our products are enough to reproduce the pixels exactly. I therefore (strongly) think that recipes should not be 'allowed' to provide the data lineage of their own output, because the best (/only) way to ensure that the data lineage is trustworthy is if it is set by the framework (either cpl or our own. Any recipe that uses only part of it's input should be refactored; e.g. in this case I would advocate for a separate calculate-background recipe (I think that is a good idea for data organization reasons too).
  3. We can also (optionally) add extra layers to the FITS files, e.g. with the background. That way it is not necessary to have an extra output product. I thought we explicitly mentioned that in the DRLD, but I cannot find it anymore. This is described in one of the ESO standards, but I cannot find the right one.

So while I would prefer we design our recipes that such an escape hatch for providing the data lineage is not necessary, we didn't actually discuss this, let alone get an agreement on this. So if the rest of us think it is good to have the option for manual data lineage, then we should make this possible.

@oczoske
Copy link
Contributor Author

oczoske commented Dec 29, 2024

There is a reason why the cpl function takes two framesets rather than one. Lars should be able to guide us.

@hugobuddel
Copy link
Contributor

We can decide to be stricter for ourselves than required by ESO, and not write recipes that require such functionality. I would be an advocate of such a restriction because that would ensure we don't make mistakes, and would (in my opinion) result in better recipes and workflows. But we did not have that discussion, let alone arrive at a conclusion, and there might always be exceptions. So I said that you are right and that we should make it possible to manually select a subset of the input frames as used frames. I just don't think we should use it ;-).

@sesquideus
Copy link
Contributor

I know about this, it was meant as a temporary workaround. I am already working on a robust mechanism for recording which frames were actually used.

@sesquideus
Copy link
Contributor

It requires me to convert inputs from a list to a set (which is probably a good idea regardless of this) so maybe more things will get broken. The tests play nicely on my branch now and I was able to catch some discrepancies with the DRL-D as well.

@hugobuddel
Copy link
Contributor

I'll copy my comment from #87 because it actually applies to this discussion.

I'm not sure that #85 solves the problem as set out in #79, which I think it is aimed to do. [...]

The main reason to distinguish between "all frames" and "used frames" is when a recipe has multiple output products, and the different output products use different input. The two main cases are:

  • The recipe has extra (optional) outputs. The example in PipelineProduct.save(): distinguish between allframes and usedframes #79 is that many recipes calculate an intermediate data product that represents the background. This background is usually discarded, but it is worthwhile to have an option to save the background to file (in addition to saving the primary output of the recipe). The primary output of the recipe would have used all input files, but the background would only have used a subset of those. So the file with the background should have fewer files listed as dependency than the primary output.
  • A recipe can process multiple files in parallel. For example the "basic reduce" recipes could for example process all science raws from a single template in parallel. A benefit would be that calibrations like the master flat and master dark would be shared between all of them, and thus would have to be loaded into memory only once. But all frames would require their own individual persistence correction. So each file should be saved with only one persistence correction as used frame.

A bit more that is not in my #87 comment:

It seems that #85 determines on the recipe level whether each input is valid, and discards the input from the used frames if it is not valid. This is a less common usecase for the used frames I think (but I have not checked). In the two cases above, all input can be valid, it's just that they are not all used for all output.

I also would like to note that an 'invalid' frame could still affect the output, and in that case should still be counted as used frame. For example, if there are 10 input raw flats, and a median filtering is applied to determine that one of the frames is an outlier and should not be used in the stacking. Then that one invalid frame would still have contributed to the median and therefore does affect the output. As in, rerunning the recipe with only the 9 valid frames might cause a different result, because then maybe one of the other frames would have become an outlier. This might be rare though.

Summarizing that last point: I think that rerunning the recipe with only the used frames in the SOF file should give the exact same output product (except for timestamps and such). If the output product would be different when a specific frame is omitted from the input SOF, then that specific frame should be part of the used frames of that output product.

@oczoske
Copy link
Contributor Author

oczoske commented Feb 7, 2025

Not reading through this discussion and just dropping something that I just noticed: PipelineProduct.save() currently takes the usedframes from self.recipe.used_frames, so it treats it as a recipe property. It should be a product property (self.used_frames) because it differs for different product types within the same recipe. (I wanted to implement that in metis_cal_chophome but haven't gotten round to doing so yet).

@hugobuddel
Copy link
Contributor

Yes that's what I said. What's the point of these issues if we don't read what the other people write?

@oczoske
Copy link
Contributor Author

oczoske commented Feb 7, 2025

Where?

@hugobuddel
Copy link
Contributor

Where?

Here:

The main reason to distinguish between "all frames" and "used frames" is when a recipe has multiple output products, and the different output products use different input.

I did not describe technically what is in the code, as I assumed that Martin knows that since Martin wrote it.

Perhaps I used a bit too many words?

Let's not argue though, and instead spend our time making things work.

@oczoske
Copy link
Contributor Author

oczoske commented Feb 7, 2025

My point was just that the current implementation (supposedly already a fix) does not seem to do what you correctly say it should do.

@sesquideus
Copy link
Contributor

@oczoske Sadly, yes.
I would recommend to

  • leave it like this for this release (but document that right now it indeed does not do what it should)
  • later explicitly mark which frames are used (possibly the Mixins can automate some of that, but otherwise it will have to be marked manually)
  • maybe devise something clever, a thin wrapper or what, that will track usage automatically

@sesquideus sesquideus self-assigned this Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants