Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create to_xarray function #85

Closed
4 tasks done
veenstrajelmer opened this issue Mar 14, 2024 · 0 comments · Fixed by #86
Closed
4 tasks done

create to_xarray function #85

veenstrajelmer opened this issue Mar 14, 2024 · 0 comments · Fixed by #86

Comments

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Mar 14, 2024

measurements.to_xarray() is possible, but this will give you a huge dataset and that is a bit inconvenient. ddlpy.simplify_dataframe(measurements).to_xarray() is better, but it will still give you all codes/omschrijvingen separately. For BAALHK 1980-2024 this gives a dataset of >1GB mainly thanks to the WaardeBepalingsmethode.Omschrijving variable.

Data variables:
    Status                                            (time) <U15 156MB ...
    WaarnemingMetadata.OpdrachtgevendeInstantieLijst  (time) <U12 125MB ...
    QC                                                (time) <U2 21MB ...
    AquoMetadata_MessageID                            (time) int64 21MB ...
    WaardeBepalingsmethode.Code                       (time) <U10 104MB ...
    WaardeBepalingsmethode.Omschrijving               (time) <U65 675MB ...
    waterlevel                                        (time) float64 21MB ...
    station_name                                      |S64 64B ...
    station_id                                        |S64 64B ...
    station_x_coordinate                              float64 8B ...
    station_y_coordinate                              float64 8B ...

Todo:

  • It would be great to set these as an attribute table to WaardeBepalingsmethode.Code, but showing all available values (Get information about possible values for different metadata keys #28). >> no
  • Also consider combining ds.attrs by merging Code/Omschrijving attrs as f"{Code} ({Omschrijving})". >> no add list of code/omschrijving keys/values to variables
  • Furthermore, consider to automatically upscale strings containing only numeric values to integers. This would probably reduce WaardeBepalingsmethode.Code significantly. It might be wise to do this already earlier in the process, although this might break some pipelines. >> not generic
  • consider to always keep some columns as variables (like kwaliteitswaardecode, status, maybe others), since that would make concatenating datasets way easier.

Example code that retrieves VLISSGN data with multiple Waardebepalingsmethode (changed on 07-09-1993):

import ddlpy
locations = ddlpy.locations()
bool_hoedanigheid = locations['Hoedanigheid.Code'].isin(['NAP'])
bool_stations = locations.index.isin(['VLISSGN'])
bool_grootheid = locations['Grootheid.Code'].isin(['WATHTE'])
bool_groepering = locations['Groepering.Code'].isin(['NVT'])
selected = locations.loc[bool_grootheid & bool_hoedanigheid & bool_groepering & bool_stations]

tstart_dt = "1993-08-25 09:47:00" #VLISSGN got new Waardebepalingsmethode in this year
tstop_dt = "1994-11-30 09:50:00"

measurements = ddlpy.measurements(selected.iloc[0], tstart_dt, tstop_dt)
simple = ddlpy.simplify_dataframe(measurements)
list_cols = ['WaardeBepalingsmethode.Code','WaardeBepalingsmethode.Omschrijving']
measurements[list_cols].drop_duplicates()

Alternative example (already works):

import ddlpy
locations = ddlpy.locations()
bool_hoedanigheid = locations['Hoedanigheid.Code'].isin(['NAP'])
bool_stations = locations.index.isin(['BAALHK'])
bool_grootheid = locations['Grootheid.Code'].isin(['WATHTE'])
bool_groepering = locations['Groepering.Code'].isin(['NVT'])
selected = locations.loc[bool_grootheid & bool_hoedanigheid & bool_groepering & bool_stations]

date_start = "1990-02-15"
date_end = "1990-03-15"
measurements = ddlpy.measurements(selected.iloc[0], date_start, date_end)
print(measurements['WaardeBepalingsmethode.Code'].drop_duplicates())

simple = ddlpy.simplify_dataframe(measurements)
ds1 = measurements.to_xarray()
ds2 = simple.to_xarray()
print(f'full [MB]: {ds1.nbytes/1024**2:.3f}')
print(f'simple [MB]: {ds2.nbytes/1024**2:.3f}')

Prints:

time
1990-02-27 01:00:00+01:00    other:F039
1990-03-01 00:00:00+01:00    other:F027
Name: WaardeBepalingsmethode.Code, dtype: object

full [MB]: 1.801
simple [MB]: 0.267

Update 27-3-2023: With the catalog_filter argument added to ddlpy.locations() or the private ddlpy.ddlpy.catalog() function added in #87, retrieving the extended catalog is easy. We can also subset the catalog dataframe directly:

import pandas as pd
import ddlpy

catalog_filter = ['Compartimenten','Eenheden','Grootheden',
                  'Hoedanigheden','Groeperingen','MeetApparaten',
                  'Typeringen','WaardeBepalingsmethoden','WaardeBewerkingsmethoden','Parameters']

if 1:
    result = ddlpy.ddlpy.catalog(catalog_filter=catalog_filter)
    cat_aquometadatalijst = pd.json_normalize(result["AquoMetadataLijst"])
else:
    locations = ddlpy.locations(catalog_filter=catalog_filter)
    cat_aquometadatalijst = locations
    # bool_grootheid = locations["Grootheid.Code"].isin(["WATHTE"])
    # bool_groepering = locations["Groepering.Code"].isin(["NVT"])
    # selected = locations.loc[bool_grootheid & bool_groepering]
    # cat_aquometadatalijst = selected

key_list_full = [x.replace(".Code","") for x in cat_aquometadatalijst.columns if x.endswith(".Code")]
# >> ['Compartiment', 'Eenheid', 'Grootheid', 'Hoedanigheid', 'MeetApparaat', 'Parameter', 'Typering', 'Groepering', 'WaardeBepalingsmethode']
for key in key_list_full:
    df_meta_unique = cat_aquometadatalijst[[f"{key}.Code",f"{key}.Omschrijving"]].drop_duplicates()
    print(f'\n{len(df_meta_unique)} unique {key} available in requested subset')
    print(df_meta_unique)
    
    dict_meta_unique = df_meta_unique.set_index(f"{key}.Code")[f"{key}.Omschrijving"].to_dict()

This can be used for adding metadata attrs, but the full set of Waardebepalingsmethode and Parameter are way to large to add:

7 unique Compartiment available in requested subset
51 unique Eenheid available in requested subset
104 unique Grootheid available in requested subset
71 unique Hoedanigheid available in requested subset
34 unique MeetApparaat available in requested subset
895 unique Parameter available in requested subset
6 unique Typering available in requested subset
5 unique Groepering available in requested subset
510 unique WaardeBepalingsmethode available in requested subset
6 unique WaardeBewerkingsmethode available in requested subset

Instead, the unique values from the subsetted extended locations dataframe can be used. This is better than using the unique values from the measurements dataframe, since these will differ per station and time range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant