`PipelineProduct.save()`: distinguish between `allframes` and `usedframes` #79

oczoske · 2024-12-29T14:09:12Z

The function cpl.dfs.save_image() takes two framesets in its argument list:

allframes : "The list of input frames for the recipe"
usedframes: "The list of raw/calibration frames used for this product"

Our method PipelineProduct.save() passes self.recipe.frameset to both arguments, which is not always appropriate. As an example, WCU templates take a number of WCU_OFF frames, that are to be combined and subtracted as background from the WCU_ON frames. It is not unreasonable to want to save the background frame as an intermediate or temporary pipeline product. In this case, usedframes should not include the WCU_ON frames as these were not used in the creation of the background image. I'm sure there are better examples among the more complex pipeline recipes.

The text was updated successfully, but these errors were encountered:

hugobuddel · 2024-12-29T18:08:31Z

Three comments to your proposal:

You are right, we should ensure our framework allows the possibility of creating recipes that have output products that have such a particular data lineage. And such an optional background image might even even be a sufficient as an exception to my objection below.
I would like that the headers of our products are enough to reproduce the pixels exactly. I therefore (strongly) think that recipes should not be 'allowed' to provide the data lineage of their own output, because the best (/only) way to ensure that the data lineage is trustworthy is if it is set by the framework (either cpl or our own. Any recipe that uses only part of it's input should be refactored; e.g. in this case I would advocate for a separate calculate-background recipe (I think that is a good idea for data organization reasons too).
We can also (optionally) add extra layers to the FITS files, e.g. with the background. That way it is not necessary to have an extra output product. I thought we explicitly mentioned that in the DRLD, but I cannot find it anymore. This is described in one of the ESO standards, but I cannot find the right one.

So while I would prefer we design our recipes that such an escape hatch for providing the data lineage is not necessary, we didn't actually discuss this, let alone get an agreement on this. So if the rest of us think it is good to have the option for manual data lineage, then we should make this possible.

oczoske · 2024-12-29T19:28:26Z

There is a reason why the cpl function takes two framesets rather than one. Lars should be able to guide us.

hugobuddel · 2024-12-29T20:34:05Z

We can decide to be stricter for ourselves than required by ESO, and not write recipes that require such functionality. I would be an advocate of such a restriction because that would ensure we don't make mistakes, and would (in my opinion) result in better recipes and workflows. But we did not have that discussion, let alone arrive at a conclusion, and there might always be exceptions. So I said that you are right and that we should make it possible to manually select a subset of the input frames as used frames. I just don't think we should use it ;-).

sesquideus · 2025-01-26T13:27:01Z

I know about this, it was meant as a temporary workaround. I am already working on a robust mechanism for recording which frames were actually used.

sesquideus · 2025-01-28T14:20:00Z

It requires me to convert inputs from a list to a set (which is probably a good idea regardless of this) so maybe more things will get broken. The tests play nicely on my branch now and I was able to catch some discrepancies with the DRL-D as well.

hugobuddel · 2025-01-29T13:16:38Z

I'll copy my comment from #87 because it actually applies to this discussion.

I'm not sure that #85 solves the problem as set out in #79, which I think it is aimed to do. [...]

The main reason to distinguish between "all frames" and "used frames" is when a recipe has multiple output products, and the different output products use different input. The two main cases are:

The recipe has extra (optional) outputs. The example in PipelineProduct.save(): distinguish between allframes and usedframes #79 is that many recipes calculate an intermediate data product that represents the background. This background is usually discarded, but it is worthwhile to have an option to save the background to file (in addition to saving the primary output of the recipe). The primary output of the recipe would have used all input files, but the background would only have used a subset of those. So the file with the background should have fewer files listed as dependency than the primary output.

A recipe can process multiple files in parallel. For example the "basic reduce" recipes could for example process all science raws from a single template in parallel. A benefit would be that calibrations like the master flat and master dark would be shared between all of them, and thus would have to be loaded into memory only once. But all frames would require their own individual persistence correction. So each file should be saved with only one persistence correction as used frame.

A bit more that is not in my #87 comment:

It seems that #85 determines on the recipe level whether each input is valid, and discards the input from the used frames if it is not valid. This is a less common usecase for the used frames I think (but I have not checked). In the two cases above, all input can be valid, it's just that they are not all used for all output.

I also would like to note that an 'invalid' frame could still affect the output, and in that case should still be counted as used frame. For example, if there are 10 input raw flats, and a median filtering is applied to determine that one of the frames is an outlier and should not be used in the stacking. Then that one invalid frame would still have contributed to the median and therefore does affect the output. As in, rerunning the recipe with only the 9 valid frames might cause a different result, because then maybe one of the other frames would have become an outlier. This might be rare though.

Summarizing that last point: I think that rerunning the recipe with only the used frames in the SOF file should give the exact same output product (except for timestamps and such). If the output product would be different when a specific frame is omitted from the input SOF, then that specific frame should be part of the used frames of that output product.

oczoske · 2025-02-07T11:04:16Z

Not reading through this discussion and just dropping something that I just noticed: PipelineProduct.save() currently takes the usedframes from self.recipe.used_frames, so it treats it as a recipe property. It should be a product property (self.used_frames) because it differs for different product types within the same recipe. (I wanted to implement that in metis_cal_chophome but haven't gotten round to doing so yet).

hugobuddel · 2025-02-07T11:32:25Z

Yes that's what I said. What's the point of these issues if we don't read what the other people write?

oczoske · 2025-02-07T11:35:18Z

Where?

hugobuddel · 2025-02-07T12:31:56Z

Where?

Here:

The main reason to distinguish between "all frames" and "used frames" is when a recipe has multiple output products, and the different output products use different input.

I did not describe technically what is in the code, as I assumed that Martin knows that since Martin wrote it.

Perhaps I used a bit too many words?

Let's not argue though, and instead spend our time making things work.

oczoske · 2025-02-07T14:40:46Z

My point was just that the current implementation (supposedly already a fix) does not seem to do what you correctly say it should do.

sesquideus · 2025-02-07T15:06:06Z

@oczoske Sadly, yes.
I would recommend to

leave it like this for this release (but document that right now it indeed does not do what it should)
later explicitly mark which frames are used (possibly the Mixins can automate some of that, but otherwise it will have to be marked manually)
maybe devise something clever, a thin wrapper or what, that will track usage automatically

hugobuddel mentioned this issue Jan 29, 2025

Converted inputs to sets + fixed inputs #87

Merged

sesquideus self-assigned this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PipelineProduct.save()`: distinguish between `allframes` and `usedframes` #79

`PipelineProduct.save()`: distinguish between `allframes` and `usedframes` #79

oczoske commented Dec 29, 2024

hugobuddel commented Dec 29, 2024

oczoske commented Dec 29, 2024

hugobuddel commented Dec 29, 2024

sesquideus commented Jan 26, 2025

sesquideus commented Jan 28, 2025

hugobuddel commented Jan 29, 2025

oczoske commented Feb 7, 2025

hugobuddel commented Feb 7, 2025

oczoske commented Feb 7, 2025

hugobuddel commented Feb 7, 2025

oczoske commented Feb 7, 2025

sesquideus commented Feb 7, 2025

PipelineProduct.save(): distinguish between allframes and usedframes #79

PipelineProduct.save(): distinguish between allframes and usedframes #79

Comments

oczoske commented Dec 29, 2024

hugobuddel commented Dec 29, 2024

oczoske commented Dec 29, 2024

hugobuddel commented Dec 29, 2024

sesquideus commented Jan 26, 2025

sesquideus commented Jan 28, 2025

hugobuddel commented Jan 29, 2025

oczoske commented Feb 7, 2025

hugobuddel commented Feb 7, 2025

oczoske commented Feb 7, 2025

hugobuddel commented Feb 7, 2025

oczoske commented Feb 7, 2025

sesquideus commented Feb 7, 2025

`PipelineProduct.save()`: distinguish between `allframes` and `usedframes` #79

`PipelineProduct.save()`: distinguish between `allframes` and `usedframes` #79