Add generic `Aggregator` processor #462

danielhuppmann · 2025-01-22T08:47:24Z

This PR is in relation to a request by @jkikstra for a CEDS-emissions-processor that renames and aggregates emissions-variables from IAM results to match the CEDS schema - see the related PR at IAMconsortium/common-definitions#255

The idea is to have a mapping like

- dimension: variable
- aggregate:
  - Target Code:
      - Variable A
      - Variable B

that can then be called like

ceds_mapping = AggregationMapping.from_yaml("ceds-mapping/ceds-variable-mapping.yaml")
ceds_aggregator = Aggregator(mapping=ceds_mapping)

ceds_aggregator.apply(df)

to apply the aggregation-mapping onto an IamDataFrame df.

The method uses the pyam rename() method because that actually works (with summation) on all dimensions. This could be extended later for different aggregation-methods, etc.

This could be useful for @lisahligono as part of the scenario-transfer-routine, where the rename-mapping for regions is implemented as an external yaml file and then just called like shown above as part of the processing-and-transfer.

danielhuppmann · 2025-01-22T08:52:22Z

Yes, I will still add tests but wanted to hear any comments about feature-design first...

phackstock · 2025-01-29T15:41:22Z

As per bilateral discussion, I'd say this is a good way forward.
I'd include the dimension attribute in the yaml file.

dc-almeida · 2025-01-30T10:02:53Z

Makes sense to me!

danielhuppmann · 2025-02-11T07:20:05Z

Ready for review @phackstock, I have a few questions from your pedantic wisdom:
I defined an attribue codes for the validate_with_definition() method, but it seems I cannot use codes in the field-validators (even with after-mode)?

Apart from that, I'm not sure about a good way to include the file name in a validation-warning - but maybe we can also leave that for a future clean-up and standardization effort across the package, we are probably not 100% consistent. There are also a few overlaps with the region-processing feature, but I wanted to focus on this PR for now.

I will also add a "rename"-config alternative to work with what @lisahligono implemented in IAMconsortium/common-definitions#272.

I will add documentation in a follow-up PR once #470 is merged.

phackstock

I would say that there's potential to simplify and shorten the code.
I have left some comments where exactly to do so in line.

phackstock · 2025-02-11T15:42:57Z

nomenclature/processor/generic.py

+    def validate_with_definition(self, dsd: DataStructureDefinition) -> None:
+        error = None
+        # check for codes that are not defined in the codelists
+        codelist = getattr(dsd, self.dimension, None)


I'd let the error occur right here if the dimension is not found:

Suggested change

codelist = getattr(dsd, self.dimension, None)

codelist = getattr(dsd, self.dimension)

phackstock · 2025-02-11T15:43:18Z

nomenclature/processor/generic.py

+        return _codes
+
+    def validate_with_definition(self, dsd: DataStructureDefinition) -> None:
+        error = None


If you raise the error below directly, you don't need this

Suggested change

error = None

phackstock · 2025-02-11T15:44:07Z

nomenclature/processor/generic.py

+        if codelist is None:
+            error = f"Dimension '{self.dimension}' not found in DataStructureDefinition"
+        elif invalid := codelist.validate_items(self.codes):
+            error = (


Suggested change

error = (

raise ValueError(f"The following {self.dimension}s are not defined in the "

"DataStructureDefinition:\n - " + "\n - ".join(invalid) + "\nin " + str(self.file) + "")

phackstock · 2025-02-11T15:44:25Z

nomenclature/processor/generic.py

+                f"The following {self.dimension}s are not defined in the "
+                "DataStructureDefinition:\n - " + "\n - ".join(invalid)
+            )
+        if error:


If the error is raised directly, you don't need this anymore

Suggested change

if error:

phackstock · 2025-02-11T15:44:30Z

nomenclature/processor/generic.py

+                "DataStructureDefinition:\n - " + "\n - ".join(invalid)
+            )
+        if error:
+            raise ValueError(error + "\nin " + str(self.file) + "")


Suggested change

raise ValueError(error + "\nin " + str(self.file) + "")

phackstock · 2025-02-11T15:46:06Z

nomenclature/processor/generic.py

+def _validate_items(items, info, _type):
+    duplicates = [item for item, count in Counter(items).items() if count > 1]
+    if duplicates:
+        raise PydanticCustomError(


Even though I introduced them, I'd actually say we should move away from using custom errors. The way that I see it, they don't really provide any additional benefit and just make the code harder to read.
We don't catch specific errors at any higher level and even if we did, they get wrapped in some pydantic error anyway

phackstock · 2025-02-11T15:47:43Z

nomenclature/processor/generic.py

+    """Aggregation or renaming of an IamDataFrame on a `dimension`"""
+    file: FilePath
+    dimension: str
+    mapping: list[AggregationItem]


I'd suggest to move to the pattern to call the keyword in the yaml file and the attribute in the python class the same thing:

Suggested change

mapping: list[AggregationItem]

aggregate: list[AggregationItem]

This is fully compatible with allowing "rename" as a keyword in the future with the use of pydantic field alias.

phackstock · 2025-02-11T15:49:46Z

nomenclature/processor/generic.py

+here = Path(__file__).parent.absolute()
+
+
+class AggregationItem(BaseModel):


I'm not sure this class is actually needed.
I think you could just use: mapping: dict[str, list[str]]. This way you also don't need the rename_mapping property below since you can use mapping directly.

phackstock · 2025-02-11T15:52:13Z

nomenclature/processor/generic.py

+            raise ValueError(error + "\nin " + str(self.file) + "")
+
+    @classmethod
+    def from_file(cls, file: Path | str):


If you rename the mapping attribute to aggregate and use dict[str, list[str]] in favor of AggregationItem, I think you should be able to remove this function almost entirely.

phackstock · 2025-02-11T15:55:30Z

I defined an attribue codes for the validate_with_definition() method, but it seems I cannot use codes in the field-validators (even with after-mode)?

That might not be possible, I'll play around with the code a bit more to see.

phackstock · 2025-02-13T13:07:32Z

nomenclature/processor/generic.py

Before I forget it, I'd suggest to rename to aggregator.py

danielhuppmann requested review from phackstock and dc-almeida January 22, 2025 08:47

danielhuppmann self-assigned this Jan 22, 2025

danielhuppmann added 3 commits February 7, 2025 04:53

Add initial implementation of Aggregator

88c6895

Add a test for duplicate components

557a33e

Add docstrings

2e9a6c5

danielhuppmann force-pushed the feature/generic-aggregator branch from bf621cb to 2e9a6c5 Compare February 7, 2025 08:19

danielhuppmann added 11 commits February 10, 2025 15:49

Add test for conflict between target and component

89bfbd2

Refactor data-folder name

8c2374e

Rename test functions

59cf536

Streamline test

bf9b58f

Add validate_with_definition() for Aggregator (and tests)

1c59923

Add a test for Aggregator.apply()

cbaccb8

Make ruff

d63b775

Add a test for invalid dimension

5db5bf1

Clean-up

f9685cf

Fix the docs

8470cc0

Cannot use \ in f-strings

a189ffe

danielhuppmann marked this pull request as ready for review February 11, 2025 07:14

phackstock reviewed Feb 11, 2025

View reviewed changes

phackstock reviewed Feb 13, 2025

View reviewed changes

nomenclature/processor/generic.py

Copy link

Contributor

phackstock Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I forget it, I'd suggest to rename to aggregator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add generic `Aggregator` processor #462

Add generic `Aggregator` processor #462

danielhuppmann commented Jan 22, 2025 •

edited

Loading

danielhuppmann commented Jan 22, 2025

phackstock commented Jan 29, 2025

dc-almeida commented Jan 30, 2025

danielhuppmann commented Feb 11, 2025

phackstock left a comment

phackstock Feb 11, 2025

phackstock Feb 11, 2025

phackstock Feb 11, 2025

phackstock Feb 11, 2025

phackstock Feb 11, 2025

phackstock Feb 11, 2025

phackstock Feb 11, 2025

phackstock Feb 11, 2025

phackstock Feb 11, 2025

phackstock commented Feb 11, 2025

phackstock Feb 13, 2025

	codelist = getattr(dsd, self.dimension, None)
	codelist = getattr(dsd, self.dimension)

-            error = (
+            raise ValueError(f"The following {self.dimension}s are not defined in the "
+                "DataStructureDefinition:\n - " + "\n - ".join(invalid) + "\nin " + str(self.file) + "")

	mapping: list[AggregationItem]
	aggregate: list[AggregationItem]

		here = Path(__file__).parent.absolute()


		class AggregationItem(BaseModel):

Add generic Aggregator processor #462

Are you sure you want to change the base?

Add generic Aggregator processor #462

Conversation

danielhuppmann commented Jan 22, 2025 • edited Loading

danielhuppmann commented Jan 22, 2025

phackstock commented Jan 29, 2025

dc-almeida commented Jan 30, 2025

danielhuppmann commented Feb 11, 2025

phackstock left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phackstock commented Feb 11, 2025

Choose a reason for hiding this comment

Add generic `Aggregator` processor #462

Add generic `Aggregator` processor #462

danielhuppmann commented Jan 22, 2025 •

edited

Loading