-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
U/jrbogart/config reorg #112
Conversation
Add utilities for each source type which uses its template and user arguments to generate an appropriate config fragment.
to enable new scheme where fragment of config for an object type is written following creation of its main binary file(s) Modify catalog_creator.py and sso_catalog_creator.py to use the new stuff. Rebased to current main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a few suggestions, but after getting part way through the changes, it might be better to just make a couple of general comments:
- There's a lot of code to make things backwards compatible. For both near and long term maintainability, it might be preferable not to support backwards compatibility and instead provide scripts to convert existing catalogs to the new format.
- Instead of having separate utility functions for writing the provenance for each object type, I'm wondering if each BaseObject subclass could implement its own member function to write provenance. I think that approach would avoid the code complexity that comes with the different if-elif branches in
config_utils.py
and would make it easier to add new object types in the future.
skycatalogs/utils/config_utils.py
Outdated
if self._schema_version: | ||
# cmps = self._cmps | ||
# if (cmps[0] > 1) or (cmps[0] == 1 and cmps[1] > 2): | ||
old_style = self._schema_pre_130 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old_style
is already set on line 281. Also, is there any reason to keep the commented out lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope
skycatalogs/utils/config_utils.py
Outdated
if old_style: | ||
return self._cfg['Cosmology'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you move this if-block to the top of this function, it looks like you can omit any further references to old_style
and thereby simplify the logic for non-old_style
cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally self._schema_pre_130
wasn't pre-computed so more complicated logic was needed here. I neglected to simply here after that was changed. Will fix.
skycatalogs/utils/config_utils.py
Outdated
if 'galaxy' in self._cfg['object_types']: | ||
object_type = 'galaxy' | ||
elif 'diffsky_galaxy' in self._cfg['object_types']: | ||
object_type = 'diffsky_galaxy' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code assumes that 'galaxy'
and 'diffsky_galaxy'
types can never be used at the same time. Is that enforced somewhere? Also, there must be some cosmology assumed for SNe and AGNs. Are we not keeping track of cosmology for those object types? How, if at all, do we ensure the cosmologies are consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no AGN currently, of course. For snana-style SNe I don't see any mention of cosmology in the code. If it's needed it should probably be included in the snana section of the config (ideally to be supplied by the time domain people along with the data). Or else there should be something connecting snana to the galaxy type.
I agree something needs to be done about multiple galaxy types - either support them or consistently disallow. The strategy here is
- use caller-specified object_type if there is one
- else if
galaxy
is present, use it - else if
diffsky_galaxy
is present use it - else throw up our hands and return None since no other object_types currently have an associated cosmology
skycatalogs/utils/config_utils.py
Outdated
|
||
|
||
def get_file_metadata(fpath, key='provenance'): | ||
class ConfigWriter(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the ()
is needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the code to create fragments to little per-object-type classes which subclass BaseConfigFragment. The subclasses are within the xxx_object.py files.
skycatalogs/utils/config_utils.py
Outdated
if not self._overwrite: | ||
try: | ||
with open(outpath, mode='x') as f: | ||
yaml.dump(input_dict, f) | ||
except FileExistsError: | ||
txt = 'write_yaml: Will not overwrite pre-existing config' | ||
self._logger.warning(txt + outpath) | ||
return | ||
else: | ||
return self.update_yaml(input_dict, outpath) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer if this function were written like this:
if self._overwrite:
return self.update_yaml(input_dict, outpath)
try:
with open(outpath, mode='x') as f:
yaml.dump(input_dict, f)
except FileExistsError:
txt = 'write_yaml: Will not overwrite pre-existing config'
self._logger.warning(txt + outpath)
return
return outpath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed.
skycatalogs/utils/config_utils.py
Outdated
except FileExistsError: | ||
txt = 'write_yaml: Will not overwrite pre-existing config' | ||
self._logger.warning(txt + outpath) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the return value of write_yaml
isn't used anywhere, so it's confusing to see the exception caught here and the function returning None
when the other case are returning non-None
values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't used currently, but returning the path written seemed like it might be useful. In case of the exception nothing is written so None
is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how a return value would be used, I would nominally expect that
try:
with open(outpath, mode='x') as f:
yaml.dump(input_dict, f)
except FileExistsError:
txt = 'write_yaml: Will not overwrite pre-existing config'
self._logger.warning(txt + outpath)
return outpath
has the desired behavior, since in both cases (i.e., with and without the exception), the path to the yaml file with the specified config is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning None is an indication (available programmatically, not just in a log message) that nothing got written
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but since nothing is using the return values, it's not clear that is or is not the right thing.
skycatalogs/utils/source_config.py
Outdated
------- | ||
A dict | ||
|
||
The resulting dict looks much like the file galaxy_template.yaml under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to indent these lines to match the rest of the docstring.
Concerning your general comments - |
An alternative to conversion scripts would be simply to make available modern versions of the catalogs in use now. I doubt anybody aside from you and me have ever run |
I don't see how I can make use of BaseObject and subclasses as hoped. There are no such objects lying around at the time the config fragments are being written. I could make another class hierarchy for object types, but there would still need to be creation of an instance for each (or alternatively make everything static) and a mapping of name to corresponding object-type object. It wouldn't be any less code than is there now but maybe more maintainable. |
One thing that I wanted to note is that the version info for the throughputs package is recorded in the
So that info could read from that location and added to the provenance output, presumably associated with the flux files. |
Thanks for the tip about throughputs version. As I mentioned at the Thursday zoom, the plan was to include this information if possible. I had seen those version numbers in the package but didn't know which, if any, was suitable for this purpose. |
Add utility script to dump metadata from a parquet file
I've added the throughputs version to the per-file metadata as suggested. I've also added a little script to dump the provenance and updated sso file creation to include metadata (I forgot about sso in the first round). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the ConfigFragment implemetation looks much cleaner. I do have some suggestions for simplifying/clarifying further.
Regarding making sure either only diffsky or cosmoDC2 galaxies are used, but not both, or alternatively, adding a mechanism to ensure cosmologies are consistent, can we add an issue to address this so that it doesn't get dropped entirely? An issue to add cosmology info for SNe would also be good.
@@ -70,7 +74,7 @@ def load_lsst_bandpasses(): | |||
bp = bp.withZeropoint('AB') | |||
lsst_bandpasses[band] = bp | |||
|
|||
return lsst_bandpasses | |||
return lsst_bandpasses, version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll note that adding the version to the return value like this is an interface change and will break client code that calls load_lsst_bandpasses
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. load_lsst_bandpasses could instead store the version as well as the bandpasses in a separate global variable and there could be a separate public get_lsst_bandpasses_version routine.
Or I could add a keyword argument "return_version", default False, and when False keep to old interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest an internal version of the function in addition to the public interface. So change the name of the current load_lsst_bandpasses
to _load_lsst_bandpasses
, and make a new public function that just returns the dict, so something along the lines of
def load_lsst_bandpasses():
return _load_lsst_bandpasses()[0]
with open(os.path.join(bp_dir, 'version_info')) as f: | ||
version = f.read().strip() | ||
else: | ||
version = 'galsim_builtin' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in this function seems to have gotten a little unwieldy. I'd suggest simplifying it like this:
lsst_bandpasses = dict()
rubin_sim_dir = os.getenv('RUBIN_SIM_DATA_DIR', None)
bp_dir = os.path.join(rubin_sim_dir, 'throughputs', 'baseline')
if os.path.isdir(bp_dir):
with open(os.path.join(bp_dir, 'version_info')) as f:
version = f.read().strip()
else:
bp_dir = None
version = 'galsim_builtin'
<...everything below the same...>
For the special branch looking for throughputs in $HOME
, I'd suggest setting RUBIN_SIM_DATA_DIR=${HOME}/rubin_sim_data
in the user environment since this doesn't seem like a use case we need to support in general. Also, I don't see BaseObject._bp_path
used anywhere so we can just omit that line. More generally, I think there should be a warning message that the builtin galsim throughputs are being used, since silently falling over to them is probably not what users would want to happen by default. Personally, I would prefer not to have that as a option at all since those bandpasses are not kept up-to-date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerning throughputs in $HOME
- that code has been there forever (well, at least 2 years). I don't mind getting rid of it once the installation procedure is well-documented, which is a work in progress. See https://github.com/LSSTDESC/skyCatalogs/tree/u/jrbogart/skycat_doc.
I would rather not get rid of the galsim_builtin option (all there is currently for Roman) but there should be a warning; I'll add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anybody use the $HOME
code? I've always wondered why it's there. afaik, that option isn't documented at all so I don't see why it even exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea whether anyone uses it. It's not documented in skyCatalogs documentation, but neither is anything else about finding throughputs or SEDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we definitely know that the de facto standard functionality associated with throughputs and SEDs as implemented is used. So, if it's possible to clean up this special case that no one uses, which has a preferred alternative, and which makes the code more complicated than it ultimately needs to be, I think we should do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I don't know of anyone who uses the $HOME
code, I can't be sure that no one does. I'll get rid of it after the preferred method is documented.
if self._opt_dict: | ||
opt = self._opt_dict | ||
other = dict() | ||
for key in opt: | ||
if opt[key] is not None: | ||
other[key] = opt[key] | ||
if len(other.keys()) > 0: | ||
data.update(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why all this is necessary. First off, I think you should set self._opt_dict = {}
in the base class __init__
, since that if self._opt_dict
test will raise if the subclasses don't set that attribute. Also, self._opt_dict
is a private attribute, and you have control over what goes into it, so testing for a None
value seems unnecessary. If it's empty, i.e., if len(self._opt_dict.keys()) == 0
, then the data.update(...)
will have no effect, so that test seems unneeded. All in all, this block of code could be replaced by
data.update(self._opt_dict)
assuming you set self._opt_dict = {}
in the base class __init__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I can just use update
, at least not as the rest of the code stands. self._opt_dict
will usually include keys like
{'area_partition': None, 'data_file_type': None}
The None
should not override what's already there. So each of the subclass __init__
routines would need to exclude those things from self._opt_dict
. I chose to do it in generic_create
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, all of the subclasses have area_partition
and data_file_type
as keyword options in their __init__
functions, so it seems those could just be moved to the base class and the logic to exclude None
put there.
for the object type. | ||
Must be implemented by subclass | ||
''' | ||
raise NotImplementedError("Must be implemented by subclass") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you could put a default implementation here instead of requiring subclasses to implement, i.e.,
def make_fragment(self):
return self.generic_create(self)
or even put the body of generic_create
here and have subclasses that re-implement use the base class version before adding entries.
|
||
self._opt_dict = {'area_partition': area_partition, | ||
'data_file_type': data_file_type} | ||
self._cosmology = cosmology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not include a Cosmology
entry in self._opt_dict
? i.e.,
self._opt_dict = {'area_partition': area_partition,
'data_file_type': data_file_type,
'Cosmology': cosmology}
I don't see self._cosmology
from this class used anywhere else. Then this class could use the default implementation of make_fragment
that I suggested be implemented in the base class.
skycatalogs/objects/galaxy_object.py
Outdated
self._opt_dict = {'area_partition': area_partition, | ||
'data_file_type': data_file_type} | ||
self._cosmology = cosmology | ||
self._tophat_bins = tophat_bins | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment in DiffskyConfigFragment
, why not just add Cosmology
and tophat
entries to self._opt_dict
:
self._opt_dict = {'area_partition': area_partition,
'data_file_type': data_file_type,
'Cosmology': cosmology,
'tophat': {'bins': tophat_bins}}
Along with the suggestion for DiffskyConfigFragment
, this would enable all subclasses to use the same make_fragment
implementation in the base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add the 'Cosmology'
entry to self._opt_dict
but not 'tophat'
because its value in the template is dict with a couple keys other than 'bins'
which would get lost with the update.
skycatalogs/objects/star_object.py
Outdated
@@ -1,10 +1,11 @@ | |||
import os | |||
import numpy as np | |||
# import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete?
@@ -25,6 +24,9 @@ | |||
from .utils.creator_utils import make_MW_extinction_av, make_MW_extinction_rv | |||
from .objects.base_object import LSST_BANDS | |||
from .objects.base_object import ROMAN_BANDS | |||
from .objects.star_object import StarConfigFragment | |||
from .objects.galaxy_object import GalaxyConfigFragment | |||
from .objects.diffsky_object import DiffskyConfigFragment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SnanaConfigFragment
and GaiaConfigFragment
aren't used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's because skyCatalogs is not responsible for making those catalogs. Those routines are provided in case someone (or some thing) that does produce them wants to also produce a suitable yaml fragment.
object_types: iterable of (pointsource) object type names | ||
inpath string path to file | ||
silent boolean if file not found and silent is true, | ||
return None. Else raise exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring entry for resolve_include
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok, but there is one item from previous comments that I suggested could be cleaned up, aside from the $HOME/bp_dir
code.
if len(other.keys()) > 0: | ||
data.update(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this could just be
data.update(other)
i.e., there is no need to test that other
is not empty. fwiw, I think a more standard way to do that would be
if other:
Reorganize config file so that almost everything is stored per object type. The top level file only includes schema version and the information necessary to locate the config and data directory and a mapping of object types to the yaml files containing details about them. Since catalog files for different object types are in general made at different times, using different run options and possibly different versions of the code, provenance is stored per object type. Cosmology (for galaxy-like types) and SED information is also now stored within the per-object block.
The old object types used to represent components of galaxies (e.g.
bulge_basic
) are no longer full-fledged object types. Their information is in the parent file.The creation code handles writing of the top-level config file and the file for the object type data being generated. If the top-level file already exists because other object type data have already been created, the creation code will update the top file with an entry for the new object type. Entries for data not handled by the creation code (e.g. gaia stars; snana) must be added by hand. Examples of suitable per-object configs may be found under
skycatalogs/data/cfg_templates
The SkyCatalog class can handle old-style configs as well as those in the new (schema version 1.3.0) format.
Some notes on details of and motivation for this update can be found here.