Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECMWF: add realtime and restructure #37

Draft
wants to merge 40 commits into
base: develop
Choose a base branch
from

Conversation

Tinkaa
Copy link
Contributor

@Tinkaa Tinkaa commented Dec 9, 2021

I added functionality to process the seasonal forecasts with us that are shared in realtime and combine this with the API data.

In that process I also ended up restructuring the API code and renaming files.

It is not fully finished (some todos and questions are marked) but a step in a good direction imo.

@Tinkaa Tinkaa requested a review from turnerm December 9, 2021 08:28
@Tinkaa Tinkaa marked this pull request as draft December 9, 2021 15:48
@Tinkaa Tinkaa changed the base branch from feature/ecmwf-seas-download-monica to develop December 9, 2021 16:19
@Tinkaa
Copy link
Contributor Author

Tinkaa commented Dec 14, 2021

@turnerm it is now ready for review :) Attempted to implement the class approach but struggling a bit with how to combine the data from the realtime and api the best. So added a few questions and that is why I left it in draft for now (plus tests still have to be added).

Copy link
Member

@turnerm turnerm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I will say that it was a joy to review this PR. Such clear and beautiful code, amazing work Tinka!

There are lots of comments but a good chunk are pretty nit-picky. The three major comments are:

  • Using pipeline, especially for the realtime class that has Malawi specific paremeters which should go in the country file
  • The combination class needs some rework, and I'm wondering if it belongs as a class or just a separate method
  • Unit tests: I just want to get in the habit of including tests with the code rather than waiting until later where they might be forgotten. Not sure how urgent this code is to get up and running though!

src/aatoolbox/utils/geoboundingbox.py Show resolved Hide resolved
src/aatoolbox/datasources/ecmwf/api_seas5_monthly.py Outdated Show resolved Hide resolved
src/aatoolbox/datasources/ecmwf/api_seas5_monthly.py Outdated Show resolved Hide resolved
src/aatoolbox/datasources/ecmwf/api_seas5_monthly.py Outdated Show resolved Hide resolved
src/aatoolbox/datasources/ecmwf/api_seas5_monthly.py Outdated Show resolved Hide resolved
src/aatoolbox/datasources/ecmwf/realtime_seas5_monthly.py Outdated Show resolved Hide resolved
@@ -0,0 +1,103 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering: Should this module be in the datasources class, or in our (perhaps to be differently named) aa-toolbox.analysis subpackage (where all postprocessing methods will go).

Is this something that we will likely want to do for many different countries?

How long does it take to run? Does the output file really need to be saved? Could it be saved in exploration instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why it is important and needs to be saved is because in pa-anticipatory-action we use it again. But there we also have the CDS source so the user can choose whether to use this data or the cds data. If choosing to use directly the ecmwf data it makes sense imo that you get one dataset that contains the combined api and realtime data. Would you agree?

It doesn't take super long to run so maybe somehow we don't have to save it if we port everything to aa-toolbox but not sure if that will work

Would probably run it for each country that uses the ecmwf data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get one dataset that contains the combined api and realtime data

I guess my hesitation is that it doesn't really fit into the paradigm of what datasource is supposed to be, i.e. the download, cleaning, and loading of data from a single source. The combination of sources feels more like post-processing to me. Do you think the results could be saved in exploration?

Uff these edge cases are always the hardest 😅 I will think about it a bit more.

Path to processed NetCDF file

"""
ecmwf_rt = EcmwfRealtime(iso3=self._iso3, geobb=self._geobb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not call the classes here, but rather pass their outputs as input paremeters. This can be handled by the pipeline

self._geobb = GeoBoundingBox(north=90, south=-90, east=0, west=360)
# round coordinates to correspond with the grid ecmwf publishes
# its data on
self._geobb.round_coords(round_val=GRID_RESOLUTION)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also not set this but rather pass it in from the other ECMWF class

GRID_RESOLUTION = 0.4 # degrees


# question: should Ecmwf be a super class of EcmwfApi and EcmwfRealtime?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because most of the methods would just override each other from the two classes since they all have the same names (almost). Multiple inheritance is more useful (I think) if you have two very different classes.

As mentioned above, I'm not sure this makese sense as a class, but I'd be interested to see how it looks once you've made the changes I suggested below.

@@ -89,6 +89,7 @@ def round_boundingbox_coords(
elif direction in ("south", "west"):
function = np.floor.__call__ # needed for mypy
offset_factor = -1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arghh the rounded_coord has to deal with floating inprecision. So we get outputs like 32.80000000000000182076576039 while we would like just 32.8. This is problematic cause when round_coords() is then called twice, it rounds up again which we wouldn't want.

@turnerm Do you know a solution for this? The internet seems to suggest using Decimal() but wondering if you have a better idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I think using decimal would make a lot of sense here. Furthermore, it might be good to make it only possible to run round_coords once (how is it getting run twice?). But this could be moved to an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 it is being run twice due to the lovely combination class but even then I think it would be good to have a check that it isn't accidentally ran twice in some way. Do you know how we could do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make an instance attributed called _is_rounded that is set to True once the coords have been rounded, and then when running round_coords just check this attribute and exit out if it's True

@Tinkaa Tinkaa mentioned this pull request Mar 1, 2022

def download(
self,
min_date: Union[str, date] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
min_date: Union[str, date] = None,
min_date: Union[str, date, None] = None,

def download(
self,
min_date: Union[str, date] = None,
max_date: Union[str, date] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
max_date: Union[str, date] = None,
max_date: Union[str, date, None] = None,

"""
return xr.load_dataset(self._get_processed_path())

def _get_raw_path(self, date_forecast: Union[pd.Timestamp, None]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _get_raw_path(self, date_forecast: Union[pd.Timestamp, None]):
def _get_raw_path(self, date_forecast: Optional[pd.Timestamp]):

import logging
from datetime import date
from pathlib import Path
from typing import Union
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from typing import Union
from typing import Optional, Union

)
# the geo_bounding_box indicates the boundaries for which data is
# downloaded and processed
if type(geo_bounding_box) == gpd.GeoDataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if type(geo_bounding_box) == gpd.GeoDataFrame:
if isinstance(geo_bounding_box, gpd.GeoDataFrame):

@@ -101,3 +105,101 @@ def _load_codab(self, layer_name: str, clobber: bool):
clobber=clobber,
)
return self._codab.load_admin_layer(layer_name=layer_name)

class Coordinates(TypedDict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also use a NamedTuple here

@@ -1,8 +1,12 @@
"""Pipeline initializer."""
from typing import TypedDict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from typing import TypedDict
from typing import Optional, TypedDict

self,
from_codab: bool = True,
from_config: bool = False,
coordinates: Coordinates = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
coordinates: Coordinates = None,
coordinates: Optional[Coordinates] = None,

from_codab: bool = True,
from_config: bool = False,
coordinates: Coordinates = None,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
):
) -> GeoBoundingBox:

"None of the options given to retrieve a geoboundingbox."
)

def load_ecmwf_realtime(self, process: bool = False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def load_ecmwf_realtime(self, process: bool = False):
def load_ecmwf_realtime(self, process: bool = False) -> xr.Dataset:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants