-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add generic Aggregator
processor
#462
base: main
Are you sure you want to change the base?
Add generic Aggregator
processor
#462
Conversation
Yes, I will still add tests but wanted to hear any comments about feature-design first... |
As per bilateral discussion, I'd say this is a good way forward. |
Makes sense to me! |
bf621cb
to
2e9a6c5
Compare
Ready for review @phackstock, I have a few questions from your pedantic wisdom: Apart from that, I'm not sure about a good way to include the file name in a validation-warning - but maybe we can also leave that for a future clean-up and standardization effort across the package, we are probably not 100% consistent. There are also a few overlaps with the region-processing feature, but I wanted to focus on this PR for now. I will also add a "rename"-config alternative to work with what @lisahligono implemented in IAMconsortium/common-definitions#272. I will add documentation in a follow-up PR once #470 is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that there's potential to simplify and shorten the code.
I have left some comments where exactly to do so in line.
def validate_with_definition(self, dsd: DataStructureDefinition) -> None: | ||
error = None | ||
# check for codes that are not defined in the codelists | ||
codelist = getattr(dsd, self.dimension, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd let the error occur right here if the dimension is not found:
codelist = getattr(dsd, self.dimension, None) | |
codelist = getattr(dsd, self.dimension) |
return _codes | ||
|
||
def validate_with_definition(self, dsd: DataStructureDefinition) -> None: | ||
error = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you raise the error below directly, you don't need this
error = None |
if codelist is None: | ||
error = f"Dimension '{self.dimension}' not found in DataStructureDefinition" | ||
elif invalid := codelist.validate_items(self.codes): | ||
error = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error = ( | |
raise ValueError(f"The following {self.dimension}s are not defined in the " | |
"DataStructureDefinition:\n - " + "\n - ".join(invalid) + "\nin " + str(self.file) + "") | |
f"The following {self.dimension}s are not defined in the " | ||
"DataStructureDefinition:\n - " + "\n - ".join(invalid) | ||
) | ||
if error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the error is raised directly, you don't need this anymore
if error: |
"DataStructureDefinition:\n - " + "\n - ".join(invalid) | ||
) | ||
if error: | ||
raise ValueError(error + "\nin " + str(self.file) + "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError(error + "\nin " + str(self.file) + "") |
def _validate_items(items, info, _type): | ||
duplicates = [item for item, count in Counter(items).items() if count > 1] | ||
if duplicates: | ||
raise PydanticCustomError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though I introduced them, I'd actually say we should move away from using custom errors. The way that I see it, they don't really provide any additional benefit and just make the code harder to read.
We don't catch specific errors at any higher level and even if we did, they get wrapped in some pydantic error anyway
"""Aggregation or renaming of an IamDataFrame on a `dimension`""" | ||
file: FilePath | ||
dimension: str | ||
mapping: list[AggregationItem] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to move to the pattern to call the keyword in the yaml file and the attribute in the python class the same thing:
mapping: list[AggregationItem] | |
aggregate: list[AggregationItem] |
This is fully compatible with allowing "rename" as a keyword in the future with the use of pydantic field alias.
here = Path(__file__).parent.absolute() | ||
|
||
|
||
class AggregationItem(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this class is actually needed.
I think you could just use: mapping: dict[str, list[str]]
. This way you also don't need the rename_mapping
property below since you can use mapping
directly.
raise ValueError(error + "\nin " + str(self.file) + "") | ||
|
||
@classmethod | ||
def from_file(cls, file: Path | str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rename the mapping
attribute to aggregate
and use dict[str, list[str]]
in favor of AggregationItem
, I think you should be able to remove this function almost entirely.
That might not be possible, I'll play around with the code a bit more to see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I forget it, I'd suggest to rename to aggregator.py
This PR is in relation to a request by @jkikstra for a CEDS-emissions-processor that renames and aggregates emissions-variables from IAM results to match the CEDS schema - see the related PR at IAMconsortium/common-definitions#255
The idea is to have a mapping like
that can then be called like
to apply the aggregation-mapping onto an IamDataFrame
df
.The method uses the pyam
rename()
method because that actually works (with summation) on all dimensions. This could be extended later for different aggregation-methods, etc.This could be useful for @lisahligono as part of the scenario-transfer-routine, where the rename-mapping for regions is implemented as an external yaml file and then just called like shown above as part of the processing-and-transfer.