create to_xarray function #85

veenstrajelmer · 2024-03-14T16:11:04Z

measurements.to_xarray() is possible, but this will give you a huge dataset and that is a bit inconvenient. ddlpy.simplify_dataframe(measurements).to_xarray() is better, but it will still give you all codes/omschrijvingen separately. For BAALHK 1980-2024 this gives a dataset of >1GB mainly thanks to the WaardeBepalingsmethode.Omschrijving variable.

Data variables:
    Status                                            (time) <U15 156MB ...
    WaarnemingMetadata.OpdrachtgevendeInstantieLijst  (time) <U12 125MB ...
    QC                                                (time) <U2 21MB ...
    AquoMetadata_MessageID                            (time) int64 21MB ...
    WaardeBepalingsmethode.Code                       (time) <U10 104MB ...
    WaardeBepalingsmethode.Omschrijving               (time) <U65 675MB ...
    waterlevel                                        (time) float64 21MB ...
    station_name                                      |S64 64B ...
    station_id                                        |S64 64B ...
    station_x_coordinate                              float64 8B ...
    station_y_coordinate                              float64 8B ...

Todo:

It would be great to set these as an attribute table to WaardeBepalingsmethode.Code, but showing all available values (Get information about possible values for different metadata keys #28). >> no
Also consider combining ds.attrs by merging Code/Omschrijving attrs as f"{Code} ({Omschrijving})". >> no add list of code/omschrijving keys/values to variables
Furthermore, consider to automatically upscale strings containing only numeric values to integers. This would probably reduce WaardeBepalingsmethode.Code significantly. It might be wise to do this already earlier in the process, although this might break some pipelines. >> not generic
consider to always keep some columns as variables (like kwaliteitswaardecode, status, maybe others), since that would make concatenating datasets way easier.

Example code that retrieves VLISSGN data with multiple Waardebepalingsmethode (changed on 07-09-1993):

import ddlpy
locations = ddlpy.locations()
bool_hoedanigheid = locations['Hoedanigheid.Code'].isin(['NAP'])
bool_stations = locations.index.isin(['VLISSGN'])
bool_grootheid = locations['Grootheid.Code'].isin(['WATHTE'])
bool_groepering = locations['Groepering.Code'].isin(['NVT'])
selected = locations.loc[bool_grootheid & bool_hoedanigheid & bool_groepering & bool_stations]

tstart_dt = "1993-08-25 09:47:00" #VLISSGN got new Waardebepalingsmethode in this year
tstop_dt = "1994-11-30 09:50:00"

measurements = ddlpy.measurements(selected.iloc[0], tstart_dt, tstop_dt)
simple = ddlpy.simplify_dataframe(measurements)
list_cols = ['WaardeBepalingsmethode.Code','WaardeBepalingsmethode.Omschrijving']
measurements[list_cols].drop_duplicates()

Alternative example (already works):

import ddlpy
locations = ddlpy.locations()
bool_hoedanigheid = locations['Hoedanigheid.Code'].isin(['NAP'])
bool_stations = locations.index.isin(['BAALHK'])
bool_grootheid = locations['Grootheid.Code'].isin(['WATHTE'])
bool_groepering = locations['Groepering.Code'].isin(['NVT'])
selected = locations.loc[bool_grootheid & bool_hoedanigheid & bool_groepering & bool_stations]

date_start = "1990-02-15"
date_end = "1990-03-15"
measurements = ddlpy.measurements(selected.iloc[0], date_start, date_end)
print(measurements['WaardeBepalingsmethode.Code'].drop_duplicates())

simple = ddlpy.simplify_dataframe(measurements)
ds1 = measurements.to_xarray()
ds2 = simple.to_xarray()
print(f'full [MB]: {ds1.nbytes/1024**2:.3f}')
print(f'simple [MB]: {ds2.nbytes/1024**2:.3f}')

Prints:

time
1990-02-27 01:00:00+01:00    other:F039
1990-03-01 00:00:00+01:00    other:F027
Name: WaardeBepalingsmethode.Code, dtype: object

full [MB]: 1.801
simple [MB]: 0.267

Update 27-3-2023: With the catalog_filter argument added to ddlpy.locations() or the private ddlpy.ddlpy.catalog() function added in #87, retrieving the extended catalog is easy. We can also subset the catalog dataframe directly:

import pandas as pd
import ddlpy

catalog_filter = ['Compartimenten','Eenheden','Grootheden',
                  'Hoedanigheden','Groeperingen','MeetApparaten',
                  'Typeringen','WaardeBepalingsmethoden','WaardeBewerkingsmethoden','Parameters']

if 1:
    result = ddlpy.ddlpy.catalog(catalog_filter=catalog_filter)
    cat_aquometadatalijst = pd.json_normalize(result["AquoMetadataLijst"])
else:
    locations = ddlpy.locations(catalog_filter=catalog_filter)
    cat_aquometadatalijst = locations
    # bool_grootheid = locations["Grootheid.Code"].isin(["WATHTE"])
    # bool_groepering = locations["Groepering.Code"].isin(["NVT"])
    # selected = locations.loc[bool_grootheid & bool_groepering]
    # cat_aquometadatalijst = selected

key_list_full = [x.replace(".Code","") for x in cat_aquometadatalijst.columns if x.endswith(".Code")]
# >> ['Compartiment', 'Eenheid', 'Grootheid', 'Hoedanigheid', 'MeetApparaat', 'Parameter', 'Typering', 'Groepering', 'WaardeBepalingsmethode']
for key in key_list_full:
    df_meta_unique = cat_aquometadatalijst[[f"{key}.Code",f"{key}.Omschrijving"]].drop_duplicates()
    print(f'\n{len(df_meta_unique)} unique {key} available in requested subset')
    print(df_meta_unique)
    
    dict_meta_unique = df_meta_unique.set_index(f"{key}.Code")[f"{key}.Omschrijving"].to_dict()

This can be used for adding metadata attrs, but the full set of Waardebepalingsmethode and Parameter are way to large to add:

7 unique Compartiment available in requested subset
51 unique Eenheid available in requested subset
104 unique Grootheid available in requested subset
71 unique Hoedanigheid available in requested subset
34 unique MeetApparaat available in requested subset
895 unique Parameter available in requested subset
6 unique Typering available in requested subset
5 unique Groepering available in requested subset
510 unique WaardeBepalingsmethode available in requested subset
6 unique WaardeBewerkingsmethode available in requested subset

Instead, the unique values from the subsetted extended locations dataframe can be used. This is better than using the unique values from the measurements dataframe, since these will differ per station and time range.

The text was updated successfully, but these errors were encountered:

veenstrajelmer linked a pull request Mar 14, 2024 that will close this issue

added dataframe_to_xarray function #86

Merged

This was referenced Apr 4, 2024

update kwk data retrieval Deltares-research/kenmerkendewaarden#5

Closed

DDL improvements (data and waterwebservices) Deltares-research/kenmerkendewaarden#4

Open

veenstrajelmer closed this as completed in #86 Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create to_xarray function #85

create to_xarray function #85

veenstrajelmer commented Mar 14, 2024 •

edited

Loading

create to_xarray function #85

create to_xarray function #85

Comments

veenstrajelmer commented Mar 14, 2024 • edited Loading

veenstrajelmer commented Mar 14, 2024 •

edited

Loading