U/jrbogart/config reorg #112

JoanneBogart · 2024-07-15T18:40:24Z

Reorganize config file so that almost everything is stored per object type. The top level file only includes schema version and the information necessary to locate the config and data directory and a mapping of object types to the yaml files containing details about them. Since catalog files for different object types are in general made at different times, using different run options and possibly different versions of the code, provenance is stored per object type. Cosmology (for galaxy-like types) and SED information is also now stored within the per-object block.

The old object types used to represent components of galaxies (e.g. bulge_basic) are no longer full-fledged object types. Their information is in the parent file.

The creation code handles writing of the top-level config file and the file for the object type data being generated. If the top-level file already exists because other object type data have already been created, the creation code will update the top file with an entry for the new object type. Entries for data not handled by the creation code (e.g. gaia stars; snana) must be added by hand. Examples of suitable per-object configs may be found under skycatalogs/data/cfg_templates

The SkyCatalog class can handle old-style configs as well as those in the new (schema version 1.3.0) format.

Some notes on details of and motivation for this update can be found here.

Add utilities for each source type which uses its template and user arguments to generate an appropriate config fragment.

to enable new scheme where fragment of config for an object type is written following creation of its main binary file(s) Modify catalog_creator.py and sso_catalog_creator.py to use the new stuff. Rebased to current main

… type = galaxy

…eady exists

…ing needed

jchiang87

I've made a few suggestions, but after getting part way through the changes, it might be better to just make a couple of general comments:

There's a lot of code to make things backwards compatible. For both near and long term maintainability, it might be preferable not to support backwards compatibility and instead provide scripts to convert existing catalogs to the new format.
Instead of having separate utility functions for writing the provenance for each object type, I'm wondering if each BaseObject subclass could implement its own member function to write provenance. I think that approach would avoid the code complexity that comes with the different if-elif branches in config_utils.py and would make it easier to add new object types in the future.

jchiang87 · 2024-07-16T15:20:37Z

skycatalogs/utils/config_utils.py

+        if self._schema_version:
+            # cmps = self._cmps
+            # if (cmps[0] > 1) or (cmps[0] == 1 and cmps[1] > 2):
+            old_style = self._schema_pre_130


old_style is already set on line 281. Also, is there any reason to keep the commented out lines?

jchiang87 · 2024-07-16T15:28:16Z

skycatalogs/utils/config_utils.py

+        if old_style:
+            return self._cfg['Cosmology']


If you move this if-block to the top of this function, it looks like you can omit any further references to old_style and thereby simplify the logic for non-old_style cases.

Originally self._schema_pre_130 wasn't pre-computed so more complicated logic was needed here. I neglected to simply here after that was changed. Will fix.

jchiang87 · 2024-07-16T15:33:10Z

skycatalogs/utils/config_utils.py

+                    if 'galaxy' in self._cfg['object_types']:
+                        object_type = 'galaxy'
+                    elif 'diffsky_galaxy' in self._cfg['object_types']:
+                        object_type = 'diffsky_galaxy'


This code assumes that 'galaxy' and 'diffsky_galaxy' types can never be used at the same time. Is that enforced somewhere? Also, there must be some cosmology assumed for SNe and AGNs. Are we not keeping track of cosmology for those object types? How, if at all, do we ensure the cosmologies are consistent?

There are no AGN currently, of course. For snana-style SNe I don't see any mention of cosmology in the code. If it's needed it should probably be included in the snana section of the config (ideally to be supplied by the time domain people along with the data). Or else there should be something connecting snana to the galaxy type.

I agree something needs to be done about multiple galaxy types - either support them or consistently disallow. The strategy here is

use caller-specified object_type if there is one

else if galaxy is present, use it

else if diffsky_galaxy is present use it

else throw up our hands and return None since no other object_types currently have an associated cosmology

jchiang87 · 2024-07-16T15:41:59Z

skycatalogs/utils/config_utils.py



-def get_file_metadata(fpath, key='provenance'):
+class ConfigWriter():


I don't think the () is needed here.

I've moved the code to create fragments to little per-object-type classes which subclass BaseConfigFragment. The subclasses are within the xxx_object.py files.

jchiang87 · 2024-07-16T16:09:01Z

skycatalogs/utils/config_utils.py

+        if not self._overwrite:
+            try:
+                with open(outpath, mode='x') as f:
+                    yaml.dump(input_dict, f)
+            except FileExistsError:
+                txt = 'write_yaml: Will not overwrite pre-existing config'
+                self._logger.warning(txt + outpath)
+                return
+        else:
+            return self.update_yaml(input_dict, outpath)



I think it would be clearer if this function were written like this:

if self._overwrite: return self.update_yaml(input_dict, outpath) try: with open(outpath, mode='x') as f: yaml.dump(input_dict, f) except FileExistsError: txt = 'write_yaml: Will not overwrite pre-existing config' self._logger.warning(txt + outpath) return return outpath

jchiang87 · 2024-07-16T16:13:46Z

skycatalogs/utils/config_utils.py

+            except FileExistsError:
+                txt = 'write_yaml: Will not overwrite pre-existing config'
+                self._logger.warning(txt + outpath)
+                return


As far as I can tell, the return value of write_yaml isn't used anywhere, so it's confusing to see the exception caught here and the function returning None when the other case are returning non-None values.

It isn't used currently, but returning the path written seemed like it might be useful. In case of the exception nothing is written so None is returned.

Depending on how a return value would be used, I would nominally expect that

try: with open(outpath, mode='x') as f: yaml.dump(input_dict, f) except FileExistsError: txt = 'write_yaml: Will not overwrite pre-existing config' self._logger.warning(txt + outpath) return outpath

has the desired behavior, since in both cases (i.e., with and without the exception), the path to the yaml file with the specified config is returned.

Returning None is an indication (available programmatically, not just in a log message) that nothing got written

Sure, but since nothing is using the return values, it's not clear that is or is not the right thing.

jchiang87 · 2024-07-16T16:31:19Z

skycatalogs/utils/source_config.py

+    -------
+    A dict
+
+The resulting dict looks much like the file galaxy_template.yaml under


It would be good to indent these lines to match the rest of the docstring.

JoanneBogart · 2024-07-17T19:18:22Z

Concerning your general comments -
Providing scripts to convert older configs to the current schema would no doubt be useful. It would also take some work, significantly more than went into the code providing backward compatibility since there are plenty of differences in form which don't affect the operation of the API. Also, when there is a change in structure like the one proposed here, there could be some awkwardness for people running an older version of the code and trying to access a catalog with a new-style config. The old code can't warn them what the problem is. After a while presumably everyone would know to use a newer version of the code, but during the transition it could be frustrating. I propose that the code for this release should have backwards-compatibility more or less as currently implemented (taking your suggestions into account). Since I don't anticipate changing the location of schema_version, now a top-level key, ever again, I can put something in the code to check that version against what's in the code and put out a message if it's newer, so in the future there will be a warning if there is any kind of mismatch.
There should also be a script handling conversion when the schema changes, but I'd like to deal with that in a separate PR.
Moving the provenance-generating code to BaseObject and subclasses seems promising. I'll try to get that in this PR if I don't run into any serious roadblocks.

jchiang87 · 2024-07-17T19:23:37Z

An alternative to conversion scripts would be simply to make available modern versions of the catalogs in use now. I doubt anybody aside from you and me have ever run create_sc.py. My worry about the backwards compatibility code in there now is that it will be technical debt that never gets retired. Also, because I had hoped it would go away, I didn't look as carefully as I would have had otherwise.

JoanneBogart · 2024-07-18T00:30:45Z

I don't see how I can make use of BaseObject and subclasses as hoped. There are no such objects lying around at the time the config fragments are being written. I could make another class hierarchy for object types, but there would still need to be creation of an instance for each (or alternatively make everything static) and a mapping of name to corresponding object-type object. It wouldn't be any less code than is there now but maybe more maintainable.

jchiang87 · 2024-07-22T17:33:12Z

One thing that I wanted to note is that the version info for the throughputs package is recorded in the baseline subfolder, e.g., for the descssim location at NERSC, there is

[login33 w_2024_22] pwd -P
/global/cfs/cdirs/descssim/imSim/lsst/data/throughputs_2023_09_07
[login33 w_2024_22] cat baseline/version_info 
1.9
[login33 w_2024_22]

So that info could read from that location and added to the provenance output, presumably associated with the flux files.

JoanneBogart · 2024-07-22T22:30:01Z

Thanks for the tip about throughputs version. As I mentioned at the Thursday zoom, the plan was to include this information if possible. I had seen those version numbers in the package but didn't know which, if any, was suitable for this purpose.

Add utility script to dump metadata from a parquet file

JoanneBogart · 2024-07-24T00:50:11Z

I've added the throughputs version to the per-file metadata as suggested. I've also added a little script to dump the provenance and updated sso file creation to include metadata (I forgot about sso in the first round).
The PR is ready for re-review.

jchiang87

Thanks, the ConfigFragment implemetation looks much cleaner. I do have some suggestions for simplifying/clarifying further.

Regarding making sure either only diffsky or cosmoDC2 galaxies are used, but not both, or alternatively, adding a mechanism to ensure cosmologies are consistent, can we add an issue to address this so that it doesn't get dropped entirely? An issue to add cosmology info for SNe would also be good.

jchiang87 · 2024-07-31T15:19:51Z

skycatalogs/objects/base_object.py

@@ -70,7 +74,7 @@ def load_lsst_bandpasses():
        bp = bp.withZeropoint('AB')
        lsst_bandpasses[band] = bp

-    return lsst_bandpasses
+    return lsst_bandpasses, version


I'll note that adding the version to the return value like this is an interface change and will break client code that calls load_lsst_bandpasses.

True. load_lsst_bandpasses could instead store the version as well as the bandpasses in a separate global variable and there could be a separate public get_lsst_bandpasses_version routine.
Or I could add a keyword argument "return_version", default False, and when False keep to old interface.

I suggest an internal version of the function in addition to the public interface. So change the name of the current load_lsst_bandpasses to _load_lsst_bandpasses, and make a new public function that just returns the dict, so something along the lines of

def load_lsst_bandpasses(): return _load_lsst_bandpasses()[0]

jchiang87 · 2024-07-31T15:45:38Z

skycatalogs/objects/base_object.py

+        with open(os.path.join(bp_dir, 'version_info')) as f:
+            version = f.read().strip()
+    else:
+        version = 'galsim_builtin'


The logic in this function seems to have gotten a little unwieldy. I'd suggest simplifying it like this:

lsst_bandpasses = dict() rubin_sim_dir = os.getenv('RUBIN_SIM_DATA_DIR', None) bp_dir = os.path.join(rubin_sim_dir, 'throughputs', 'baseline') if os.path.isdir(bp_dir): with open(os.path.join(bp_dir, 'version_info')) as f: version = f.read().strip() else: bp_dir = None version = 'galsim_builtin' <...everything below the same...>

For the special branch looking for throughputs in $HOME, I'd suggest setting RUBIN_SIM_DATA_DIR=${HOME}/rubin_sim_data in the user environment since this doesn't seem like a use case we need to support in general. Also, I don't see BaseObject._bp_path used anywhere so we can just omit that line. More generally, I think there should be a warning message that the builtin galsim throughputs are being used, since silently falling over to them is probably not what users would want to happen by default. Personally, I would prefer not to have that as a option at all since those bandpasses are not kept up-to-date.

Concerning throughputs in $HOME - that code has been there forever (well, at least 2 years). I don't mind getting rid of it once the installation procedure is well-documented, which is a work in progress. See https://github.com/LSSTDESC/skyCatalogs/tree/u/jrbogart/skycat_doc.
I would rather not get rid of the galsim_builtin option (all there is currently for Roman) but there should be a warning; I'll add that.

Does anybody use the $HOME code? I've always wondered why it's there. afaik, that option isn't documented at all so I don't see why it even exists.

I have no idea whether anyone uses it. It's not documented in skyCatalogs documentation, but neither is anything else about finding throughputs or SEDs.

But we definitely know that the de facto standard functionality associated with throughputs and SEDs as implemented is used. So, if it's possible to clean up this special case that no one uses, which has a preferred alternative, and which makes the code more complicated than it ultimately needs to be, I think we should do that.

Although I don't know of anyone who uses the $HOME code, I can't be sure that no one does. I'll get rid of it after the preferred method is documented.

jchiang87 · 2024-07-31T16:26:59Z

skycatalogs/objects/base_config_fragment.py

+        if self._opt_dict:
+            opt = self._opt_dict
+            other = dict()
+            for key in opt:
+                if opt[key] is not None:
+                    other[key] = opt[key]
+            if len(other.keys()) > 0:
+                data.update(other)


I'm wondering why all this is necessary. First off, I think you should set self._opt_dict = {} in the base class __init__, since that if self._opt_dict test will raise if the subclasses don't set that attribute. Also, self._opt_dict is a private attribute, and you have control over what goes into it, so testing for a None value seems unnecessary. If it's empty, i.e., if len(self._opt_dict.keys()) == 0, then the data.update(...) will have no effect, so that test seems unneeded. All in all, this block of code could be replaced by

data.update(self._opt_dict)

assuming you set self._opt_dict = {} in the base class __init__.

I don't think I can just use update, at least not as the rest of the code stands. self._opt_dict will usually include keys like
{'area_partition': None, 'data_file_type': None}
The None should not override what's already there. So each of the subclass __init__ routines would need to exclude those things from self._opt_dict. I chose to do it in generic_create instead.

From what I can tell, all of the subclasses have area_partition and data_file_type as keyword options in their __init__ functions, so it seems those could just be moved to the base class and the logic to exclude None put there.

jchiang87 · 2024-07-31T16:50:59Z

skycatalogs/objects/base_config_fragment.py

+        for the object type.
+        Must be implemented by subclass
+        '''
+        raise NotImplementedError("Must be implemented by subclass")


It seems like you could put a default implementation here instead of requiring subclasses to implement, i.e.,

def make_fragment(self): return self.generic_create(self)

or even put the body of generic_create here and have subclasses that re-implement use the base class version before adding entries.

jchiang87 · 2024-07-31T16:57:32Z

skycatalogs/objects/diffsky_object.py

+
+        self._opt_dict = {'area_partition': area_partition,
+                          'data_file_type': data_file_type}
+        self._cosmology = cosmology


Why not include a Cosmology entry in self._opt_dict? i.e.,

self._opt_dict = {'area_partition': area_partition, 'data_file_type': data_file_type, 'Cosmology': cosmology}

I don't see self._cosmology from this class used anywhere else. Then this class could use the default implementation of make_fragment that I suggested be implemented in the base class.

jchiang87 · 2024-07-31T17:01:03Z

skycatalogs/objects/galaxy_object.py

+        self._opt_dict = {'area_partition': area_partition,
+                          'data_file_type': data_file_type}
+        self._cosmology = cosmology
+        self._tophat_bins = tophat_bins
+


Similar to my comment in DiffskyConfigFragment, why not just add Cosmology and tophat entries to self._opt_dict:

self._opt_dict = {'area_partition': area_partition, 'data_file_type': data_file_type, 'Cosmology': cosmology, 'tophat': {'bins': tophat_bins}}

Along with the suggestion for DiffskyConfigFragment, this would enable all subclasses to use the same make_fragment implementation in the base class.

I can add the 'Cosmology' entry to self._opt_dict but not 'tophat' because its value in the template is dict with a couple keys other than 'bins' which would get lost with the update.

jchiang87 · 2024-07-31T17:02:09Z

skycatalogs/objects/star_object.py

@@ -1,10 +1,11 @@
 import os
-import numpy as np
+# import numpy as np


jchiang87 · 2024-07-31T17:28:31Z

skycatalogs/catalog_creator.py

@@ -25,6 +24,9 @@
 from .utils.creator_utils import make_MW_extinction_av, make_MW_extinction_rv
 from .objects.base_object import LSST_BANDS
 from .objects.base_object import ROMAN_BANDS
+from .objects.star_object import StarConfigFragment
+from .objects.galaxy_object import GalaxyConfigFragment
+from .objects.diffsky_object import DiffskyConfigFragment


SnanaConfigFragment and GaiaConfigFragment aren't used anywhere?

No, that's because skyCatalogs is not responsible for making those catalogs. Those routines are provided in case someone (or some thing) that does produce them wants to also produce a suitable yaml fragment.

jchiang87 · 2024-07-31T17:44:26Z

skycatalogs/utils/config_utils.py

-    object_types: iterable of (pointsource) object type names
+    inpath     string            path to file
+    silent     boolean           if file not found and silent is true,
+                                 return None.  Else raise exception


Please add a docstring entry for resolve_include.

jchiang87

Looks ok, but there is one item from previous comments that I suggested could be cleaned up, aside from the $HOME/bp_dir code.

jchiang87 · 2024-08-02T14:49:59Z

skycatalogs/objects/base_config_fragment.py

+        if len(other.keys()) > 0:
+            data.update(other)


I still think this could just be

data.update(other)

i.e., there is no need to test that other is not empty. fwiw, I think a more standard way to do that would be

if other:

JoanneBogart added 13 commits July 11, 2024 11:09

Add template config files for each supported source type.

f03cfda

Add utilities for each source type which uses its template and user arguments to generate an appropriate config fragment.

various bug fixes. Can now write config fragment, top file for object…

3f82bc0

… type = galaxy

fix rebase missteps

feb5643

add routine update_yaml to unconditionally write file, even if it alr…

9130d53

…eady exists

reorganization of config essentially working; more polishing and test…

3d9ddad

…ing needed

bug fix needed for schema backward-compatibility

88e306b

Add/improve docstrings; eliminate unused code

0ceeb69

minor fixes so old test programs will run

52b2636

centralize handling of (config) schema version in Config class

4572326

Add CI test for new arrangement of config file

20549e2

Rename test module

25c69e9

make test names unique

27c9683

JoanneBogart requested a review from jchiang87 July 15, 2024 19:33

JoanneBogart added 2 commits July 15, 2024 12:54

sso code needed a minor update for new config system

e43dc18

sso updates for new config arrangement

bca6a51

JoanneBogart marked this pull request as draft July 15, 2024 22:36

updates for SSO. Also centralize "old style" handling

c774e56

JoanneBogart marked this pull request as ready for review July 15, 2024 23:17

jchiang87 requested changes Jul 16, 2024

View reviewed changes

address some reviewer comments

06c6f6c

reorganize creation of yaml fragments for greater maintainability

6da4f96

JoanneBogart requested a review from jchiang87 July 19, 2024 23:58

JoanneBogart added 2 commits July 22, 2024 17:22

Add code to retrieve throughputs versions, write to flux file metadata

d553ccf

Add utility script to dump metadata from a parquet file

sso binary output should also have metadata

23a74ae

fix reference to throughputs version

820aa6c

jchiang87 requested changes Jul 31, 2024

View reviewed changes

JoanneBogart mentioned this pull request Jul 31, 2024

Better provenance tracking #64

Closed

address reviewer comments

fc25144

JoanneBogart requested a review from jchiang87 August 2, 2024 00:45

jchiang87 approved these changes Aug 2, 2024

View reviewed changes

one last bit of clean-up

01a5309

JoanneBogart merged commit 192da85 into main Aug 2, 2024
1 check passed

JoanneBogart deleted the u/jrbogart/config_reorg branch August 2, 2024 18:25

This was referenced Aug 7, 2024

Keep most config information (input to skyCatalogs API) partitioned into separate files by source #76

Closed

Config schema version #68

Closed



		def get_file_metadata(fpath, key='provenance'):
		class ConfigWriter():

U/jrbogart/config reorg #112

U/jrbogart/config reorg #112

Conversation

JoanneBogart commented Jul 15, 2024 • edited Loading

jchiang87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoanneBogart Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jchiang87 Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoanneBogart commented Jul 17, 2024

jchiang87 commented Jul 17, 2024

JoanneBogart commented Jul 18, 2024

jchiang87 commented Jul 22, 2024

JoanneBogart commented Jul 22, 2024

JoanneBogart commented Jul 24, 2024

jchiang87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoanneBogart Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jchiang87 Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jchiang87 Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jchiang87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoanneBogart commented Jul 15, 2024 •

edited

Loading

JoanneBogart Jul 17, 2024 •

edited

Loading

jchiang87 Jul 16, 2024 •

edited

Loading

JoanneBogart Jul 31, 2024 •

edited

Loading

jchiang87 Jul 31, 2024 •

edited

Loading

jchiang87 Jul 31, 2024 •

edited

Loading