-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECMWF: add realtime and restructure #37
base: develop
Are you sure you want to change the base?
Conversation
…DAP/pa-aa-tools into feature/ecmwf-seas-download-monica
@turnerm it is now ready for review :) Attempted to implement the class approach but struggling a bit with how to combine the data from the realtime and api the best. So added a few questions and that is why I left it in draft for now (plus tests still have to be added). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, I will say that it was a joy to review this PR. Such clear and beautiful code, amazing work Tinka!
There are lots of comments but a good chunk are pretty nit-picky. The three major comments are:
- Using
pipeline
, especially for the realtime class that has Malawi specific paremeters which should go in the country file - The combination class needs some rework, and I'm wondering if it belongs as a class or just a separate method
- Unit tests: I just want to get in the habit of including tests with the code rather than waiting until later where they might be forgotten. Not sure how urgent this code is to get up and running though!
@@ -0,0 +1,103 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering: Should this module be in the datasources class, or in our (perhaps to be differently named) aa-toolbox.analysis
subpackage (where all postprocessing methods will go).
Is this something that we will likely want to do for many different countries?
How long does it take to run? Does the output file really need to be saved? Could it be saved in exploration instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why it is important and needs to be saved is because in pa-anticipatory-action
we use it again. But there we also have the CDS
source so the user can choose whether to use this data or the cds data. If choosing to use directly the ecmwf data it makes sense imo that you get one dataset that contains the combined api and realtime data. Would you agree?
It doesn't take super long to run so maybe somehow we don't have to save it if we port everything to aa-toolbox
but not sure if that will work
Would probably run it for each country that uses the ecmwf data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get one dataset that contains the combined api and realtime data
I guess my hesitation is that it doesn't really fit into the paradigm of what datasource
is supposed to be, i.e. the download, cleaning, and loading of data from a single source. The combination of sources feels more like post-processing to me. Do you think the results could be saved in exploration?
Uff these edge cases are always the hardest 😅 I will think about it a bit more.
Path to processed NetCDF file | ||
|
||
""" | ||
ecmwf_rt = EcmwfRealtime(iso3=self._iso3, geobb=self._geobb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not call the classes here, but rather pass their outputs as input paremeters. This can be handled by the pipeline
self._geobb = GeoBoundingBox(north=90, south=-90, east=0, west=360) | ||
# round coordinates to correspond with the grid ecmwf publishes | ||
# its data on | ||
self._geobb.round_coords(round_val=GRID_RESOLUTION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also not set this but rather pass it in from the other ECMWF class
GRID_RESOLUTION = 0.4 # degrees | ||
|
||
|
||
# question: should Ecmwf be a super class of EcmwfApi and EcmwfRealtime? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because most of the methods would just override each other from the two classes since they all have the same names (almost). Multiple inheritance is more useful (I think) if you have two very different classes.
As mentioned above, I'm not sure this makese sense as a class, but I'd be interested to see how it looks once you've made the changes I suggested below.
@@ -89,6 +89,7 @@ def round_boundingbox_coords( | |||
elif direction in ("south", "west"): | |||
function = np.floor.__call__ # needed for mypy | |||
offset_factor = -1 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arghh the rounded_coord
has to deal with floating inprecision. So we get outputs like 32.80000000000000182076576039
while we would like just 32.8
. This is problematic cause when round_coords()
is then called twice, it rounds up again which we wouldn't want.
@turnerm Do you know a solution for this? The internet seems to suggest using Decimal()
but wondering if you have a better idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes I think using decimal would make a lot of sense here. Furthermore, it might be good to make it only possible to run round_coords
once (how is it getting run twice?). But this could be moved to an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 it is being run twice due to the lovely combination class but even then I think it would be good to have a check that it isn't accidentally ran twice in some way. Do you know how we could do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could make an instance attributed called _is_rounded
that is set to True
once the coords have been rounded, and then when running round_coords
just check this attribute and exit out if it's True
|
||
def download( | ||
self, | ||
min_date: Union[str, date] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_date: Union[str, date] = None, | |
min_date: Union[str, date, None] = None, |
def download( | ||
self, | ||
min_date: Union[str, date] = None, | ||
max_date: Union[str, date] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_date: Union[str, date] = None, | |
max_date: Union[str, date, None] = None, |
""" | ||
return xr.load_dataset(self._get_processed_path()) | ||
|
||
def _get_raw_path(self, date_forecast: Union[pd.Timestamp, None]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _get_raw_path(self, date_forecast: Union[pd.Timestamp, None]): | |
def _get_raw_path(self, date_forecast: Optional[pd.Timestamp]): |
import logging | ||
from datetime import date | ||
from pathlib import Path | ||
from typing import Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from typing import Union | |
from typing import Optional, Union |
) | ||
# the geo_bounding_box indicates the boundaries for which data is | ||
# downloaded and processed | ||
if type(geo_bounding_box) == gpd.GeoDataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if type(geo_bounding_box) == gpd.GeoDataFrame: | |
if isinstance(geo_bounding_box, gpd.GeoDataFrame): |
@@ -101,3 +105,101 @@ def _load_codab(self, layer_name: str, clobber: bool): | |||
clobber=clobber, | |||
) | |||
return self._codab.load_admin_layer(layer_name=layer_name) | |||
|
|||
class Coordinates(TypedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also use a NamedTuple
here
@@ -1,8 +1,12 @@ | |||
"""Pipeline initializer.""" | |||
from typing import TypedDict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from typing import TypedDict | |
from typing import Optional, TypedDict |
self, | ||
from_codab: bool = True, | ||
from_config: bool = False, | ||
coordinates: Coordinates = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coordinates: Coordinates = None, | |
coordinates: Optional[Coordinates] = None, |
from_codab: bool = True, | ||
from_config: bool = False, | ||
coordinates: Coordinates = None, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
): | |
) -> GeoBoundingBox: |
"None of the options given to retrieve a geoboundingbox." | ||
) | ||
|
||
def load_ecmwf_realtime(self, process: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def load_ecmwf_realtime(self, process: bool = False): | |
def load_ecmwf_realtime(self, process: bool = False) -> xr.Dataset: |
I added functionality to process the seasonal forecasts with us that are shared in realtime and combine this with the API data.
In that process I also ended up restructuring the API code and renaming files.
It is not fully finished (some todos and questions are marked) but a step in a good direction imo.