Velocity aggregated time series product #92

diodon · 2020-02-02T22:18:07Z

This product flattens UCUR, VCUR, WCUR and reference the values to its TIME and absolute DEPTH. The values are aggregated from all deployments at one site in an indexed ragged array structure with OBSERVATION and INSTRUMENT as the sole dimensions

ocehugo · 2020-02-04T00:30:56Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+        with nc4.Dataset(file, 'r') as ds:
+            time_start.append(np.datetime64(ds.time_deployment_start))
+    tuples = sorted(zip(time_start, files_to_agg))
+    return [t[1] for t in tuples]


change files_to_agg to files or file_list or file_str_list -> there is no agg here and put the type intention.

Change file to filestr in the loop -> again type intention.

return [file for _,file in tuples] is clearer

ocehugo · 2020-02-04T00:31:56Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    allowed_dimensions = ['TIME', 'LATITUDE', 'LONGITUDE', 'HEIGHT_ABOVE_SENSOR']
+    required_variables = ['UCUR', 'VCUR', 'WCUR']
+    error_list = []
+


Wouldn't the allowed_dimensions and required_variables be better in a global variable (outside the functional scope?)?

ocehugo · 2020-02-04T00:33:04Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    required_variables = ['UCUR', 'VCUR', 'WCUR']
+    error_list = []
+
+    nc_site_code = nc.site_code


I think using "nc.site_code" directly on the if is better - you are not adding any information or scope by naming it again.

ocehugo · 2020-02-04T00:33:11Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    if nc_site_code != site_code:
+        error_list.append('Wrong site_code: ' + nc_site_code)
+
+    nc_file_version = nc.file_version


same as above

ocehugo · 2020-02-04T00:39:06Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    return error_list
+
+
+def get_nvalues(nc):


nvalues means nothing here, hence intent is not clear.

e.g. get_number_of_vertical_cells is much better - I don't even need to read the docstring.

Actually,why not just return the HEIGHT and the TIME variables? The intent will be clearer in the function call - The name would be "get_dim_len" or alike.

I've renamed the function and variables for clarity. The objective is to return the number of values that result in the flattening of the grid

see comment below about this function - it's not necessary.

ocehugo · 2020-02-04T00:44:54Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    return nvalues, nbins
+
+
+def get_varvalues(nc, varname):


def flat_variable is clearer IMO

Agree, changed

ocehugo · 2020-02-04T00:54:17Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    return '; '.join([nc.deployment_code, nc.instrument, nc.instrument_serial_number])
+
+
+def in_water(nc):


~~You don't need to use the "nc" class as argument, you just need an array and two strings.~~
Hence, If it defined as def get_in_water(time, start_str, end_str),
the code call would be:

time = nc['TIME'] in_water_ind = get_in_water(time,nc['time_coverage_start'],nc['time_coverage_end'])

I just noticed that the variable name here is confusing - you use nc to reference to a xarray dataset. Just rename the variable to xrobj or xobj. I did the comment above expecting the arguments to be netCDF4 objects.

This is a case where type hints shines.

nc for xarray and ds to netcdf4 dataset

Also, you can now just import this function from aggregated_timeseries (and eventually from the utils/common module)

ocehugo · 2020-02-04T01:02:33Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+
+def in_water(nc):
+    """
+    cut data to in-water only timestamps, dropping the out-of-water records.


"cut the entire dataset to"

ocehugo · 2020-02-04T01:09:56Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    :return: name of the resulting file, list of rejected files
+    """
+
+    varlist = ['UCUR', 'VCUR', 'WCUR', 'DEPTH']


This block of vars can be defined (promoted) to keyword arguments and be dynamically assigned, if required.

agg_options = {'varlist': ..., 'time_units': , ...] # or even a global dict aggregate_velocity(...,varlist=[...],time_units=...). # or **agg_options

could be but we don't expect to aggregate anything else apart from those variables.

anyway, this is a good practice both codewise and intent wise. For example, when reading, why the function "aggregate velocity" does not provide the option to select which velocity/vars I want to aggregate?

Code wise, "default" options should hardly be defined in the scope of the function. Also, if you put as kwargs and need to change the parameters you just change the function call instead of changing the function code!

ocehugo · 2020-02-04T01:10:31Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    rejected_files = []
+
+    # default name for temporary file. It will be renamed at the end
+    outfile = 'Velocity_agg_tmp.nc'


use random names - if calling more than one thread at the same time things can get ugly

changed to UUID name

ocehugo · 2020-02-04T01:11:42Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    outfile = 'Velocity_agg_tmp.nc'
+
+    ## sort the file list in chronological order
+    files_to_agg = sort_files(files_to_agg)


sorted_files is better.

ocehugo · 2020-02-04T01:12:27Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    for file in files_to_agg:
+        with xr.open_dataset(file) as nc:
+            ## clip to in water data only
+            nc = in_water(nc)


in_water_only = in_water(nc)

dont hide the state mutation. In another words, do not reuse the name if the state changed - There is no copies here only views

Are you sure it's only a view? The documentation (of Dataset.where used by in_water function) is not quite clear, but I reckon it returns a new Dataset object. This is quite a waste if you just want to count the number of data values.

Maybe it's not a big deal in terms of execution time, but in any case you could move the clipping into the if not error_list: clause so you're not applying it to files you then reject.

ocehugo · 2020-02-04T01:20:08Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+            ## clip to in water data only
+            nc = in_water(nc)
+
+            varlen_file.append(get_nvalues(nc))


You actually dont need a 15 line get_nvalues function to execute two lines:

DIM_X = 1 if 'HEIGHT_ABOVE_SENSOR' not in nc else x['HEIGHT_ABOVE_SENSOR'].size file_size = DIM_X*nc['TIME'].size

ocehugo · 2020-02-04T01:21:30Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+            nc = in_water(nc)
+
+            varlen_file.append(get_nvalues(nc))
+            error_list = check_file(nc, site_code)


check first - move this line and the if statements atop

ocehugo · 2020-02-04T01:23:55Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+                varlen_list.append(get_nvalues(nc)[0])
+            else:
+                bad_files.append([file, error_list])
+                rejected_files.append(file)


you can create a "good_files' list instead and get rid of the loop in line 175. You can print bad files or just estimate the bad files when reporting [x for x in file_agg if x not in good_files]

ocehugo · 2020-02-04T01:26:17Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+        files_to_agg.remove(file[0])
+
+
+    varlen_list = [0] + varlen_list


check emptiness instead of doing this.

zero is needed as the start index in the main loop.

After going down and back again I understand what this variable is.Hence, uou should rename it -> It;s not a varlen_list, it's an index list.

Also, you may pre-compute the start/end of the indexes (line 214/215) here instead and use it in the for loop (also easier to debug the index ranges).

ocehugo · 2020-02-04T01:31:51Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    ds = nc4.Dataset(os.path.join(base_path, outfile), 'w')
+    OBSERVATION = ds.createDimension('OBSERVATION', size=varlen_total)
+    INSTRUMENT = ds.createDimension('INSTRUMENT', size=n_files)
+


create dictionary with options:

obs_float_template = {'datatype':'float','zlib':True,'dimensions':('OBSERVATIONS'),"fill_value":99999.0} obs_byte_template = {'datatype:'byte','zlib':True,'dimensions':('OBSERVATIONS'),'fill_value':99} ... vdefs['UCUR'] = {varname:'UCUR',**float_template} vdefs['UCURqc'] = {varname:'UCURqc',**byte_template}; ... for vname,vopts in vdefs.items(): vars[k] = ds.createVariable(**vopts)

ocehugo · 2020-02-04T01:36:03Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+            if 'WCUR' in nc.data_vars:
+                WCUR[start:end] = get_varvalues(nc, 'WCUR')
+                WCURqc[start:end] = get_varvalues(nc, 'WCUR_quality_control')
+            else:


There can be a case where WCUR is only found in some files...I would imagine that, if I asked for WCUR in varlist, I would have a partially filled vector in the case of missing variables in some files.

ocehugo · 2020-02-04T01:38:17Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+
+            start = sum(varlen_list[:index + 1])
+            end = sum(varlen_list[:index + 2])
+            n_cells = get_nvalues(nc)[1]


Again - it's not easy to guess what the the second argument returned by get_nvalues is because the function name is not useful (you have to dig into the docstring).

ocehugo · 2020-02-04T01:46:22Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+                DEPTH[start:end] = nc.DEPTH.values
+                DEPTHqc[start:end] = nc.DEPTH_quality_control.values
+            ## set TIME and instrument index
+            TIME[start:end] = (np.repeat(get_varvalues(nc, 'TIME'), n_cells) - epoch) / one_day


The expression inside the parenthesis is a bit long. Since this is an assignement operation, I would do without a big parenthesis

TIME[start:end] = np.repeat(...)/one_day - epoch/one_day

The intent of normalization is much better (and should be slightly quicker).

ocehugo · 2020-02-04T01:47:50Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+            ## get and store deployment metadata
+            LATITUDE[index] = nc.LATITUDE.values
+            LONGITUDE[index] = nc.LONGITUDE.values
+            NOMINAL_DEPTH[index] = TStools.get_nominal_depth(nc)


I don't understand the use of this package - TStools. You are only getting netcdf attributes. do you really need to import this package to do that?

the package contains several utility functions. I use three of them in the code. In the future a refactored version of it will be a package with functions used by all product-generating scripts.

IMO it's not required -> it's just noise for very little gain

get_nominal_depth returns the value of the NOMINAL_DEPTH variable, if it exists, and the global attribute instrument_nominal_depth otherwise, and this is used in all of the products, so it makes sense to have it in a common module.

ocehugo · 2020-02-04T01:50:56Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+        ds[var].setncatts(variable_attribute_dictionary[var])
+
+    ## set global attrs
+    timeformat = '%Y-%m-%dT%H:%M:%SZ'


another two options - should be kwargs.

ocehugo · 2020-02-04T01:52:40Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    timeformat = '%Y-%m-%dT%H:%M:%SZ'
+    file_timeformat = '%Y%m%d'
+
+    time_start = nc4.num2date(np.min(TIME[:]), time_units, time_calendar).strftime(timeformat)


call this block a function

set_global_attrs(ds,time_units,time_calendar,time_format,... options={...} set_global_attrs(ds,**options)

ocehugo · 2020-02-04T02:01:50Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+    ds.close()
+
+
+    ## create the output file name and rename the tmp file


another case for a function output_name = rename_file(oldname,newname).
Also, you need to check for overwrites (rename to an already existing one).

Personally, I don't think you need to use "tmp" files or to rename at all - just do the kind of checks you are doing here and above before processing/concat the files (reading attributes, etc). If you think on this steps (check stuff, create a valid name) as a validation step, then you only spend time in valid inputs.

For example, What is the use of an entire concatenated array in memory if you cant find the site_code, facility_code, contributors, etc? Check the easy things first...

ocehugo · 2020-02-04T02:06:57Z

aodntools/timeseries_products/velocity_aggregated_timeseries_template.json

@@ -0,0 +1,169 @@
+{


I imagine you probably have a lot of templates at the moment - maybe a time for refactoring?

For example, a lot of fields are probably repeats from other templates - maybe a template to rule them all, while small templates only put stuff to "overwrite" bits. This may also avoid spread typos, slightly different entries, and help in maintenance (imagine something need to change - you have to remember and change all templates...).

it is in the plan

ocehugo · 2020-02-04T02:10:25Z

@diodon, don't forget to fix the aggregate_timeseries part - tests are not passing and there is some warnings about duplicated fill_values in the ncwriter.

mhidas

👍 Looks pretty good. Just a few more comments from me :)

aodntools/timeseries_products/velocity_aggregated_timeseries.py

mhidas · 2020-02-19T23:36:37Z

aodntools/timeseries_products/velocity_aggregated_timeseries.py

+            ## get and store deployment metadata
+            LATITUDE[index] = nc.LATITUDE.values
+            LONGITUDE[index] = nc.LONGITUDE.values
+            NOMINAL_DEPTH[index] = TStools.get_nominal_depth(nc)


get_nominal_depth returns the value of the NOMINAL_DEPTH variable, if it exists, and the global attribute instrument_nominal_depth otherwise, and this is used in all of the products, so it makes sense to have it in a common module.

aodntools/timeseries_products/velocity_aggregated_timeseries.py

Co-Authored-By: Marty Hidas <[email protected]>

…s into velocity_aggregated update commits accepted in github

aodntools/timeseries_products/velocity_aggregated_timeseries.py

mhidas

Added a few comments on the documentation.

mhidas · 2020-02-24T06:45:43Z

aodntools/timeseries_products/Documentation/velocity_aggregated_timeseries.md

+
+## Objective
+
+This product provides aggregated U, V, and W velocity time-series files for each mooring site, without any interpolation or filtering, except for the exclusion of the out-of-water data. For the ADCP instruments, the absulte depth of the measuring cell is calculated using the `DEPTH` measured at the instrument and the `HEIGHT_ABOVE_SENSOR`, The Quality Control (QC) flags are preserved. All the (python) code used for the generation of the products is openly available on GitHub.


Suggested change

This product provides aggregated U, V, and W velocity time-series files for each mooring site, without any interpolation or filtering, except for the exclusion of the out-of-water data. For the ADCP instruments, the absulte depth of the measuring cell is calculated using the `DEPTH` measured at the instrument and the `HEIGHT_ABOVE_SENSOR`, The Quality Control (QC) flags are preserved. All the (python) code used for the generation of the products is openly available on GitHub.

This product provides aggregated U, V, and W velocity time-series files for each mooring site, without any interpolation or filtering, except for the exclusion of the out-of-water data. For the profiling (ADCP) instruments, the absolute depth of the measuring cell is calculated using the `DEPTH` measured at the instrument and the `HEIGHT_ABOVE_SENSOR`, The Quality Control (QC) flags are preserved.

aodntools/timeseries_products/Documentation/velocity_aggregated_timeseries.md

aodntools/timeseries_products/velocity_aggregated_timeseries.py

mhidas · 2020-02-25T01:18:34Z

You applied my documentation suggestions to the wrong file (aggregated_timeseries.md instead of velocity_aggregated_timeseries.md)!

This reverts commit 2c82b8d

mhidas · 2020-02-25T05:12:20Z

aodntools/timeseries_products/velocity_aggregated_timeseries_template.json

+        },
+        "CELL_INDEX": {
+            "long_name": "index of the corresponding measuring cell",
+            "comment": "Cell index is included for reference only and cannot be used to extract values at constant depth. The number and vertical spacing of cells can vary by instrument and deployment. The vertical spacing also varies with time during a deployment. The closest cell to the sensor has index 0."


@diodon @ggalibert Do you think this is clear enough?

mhidas · 2020-02-25T05:15:57Z

Ok, I think this is good enough for a first go!

initial commit

de12929

diodon requested review from ocehugo and mhidas February 2, 2020 22:18

diodon self-assigned this Feb 2, 2020

ocehugo reviewed Feb 4, 2020

View reviewed changes

mhidas reviewed Feb 20, 2020

View reviewed changes

diodon and others added 13 commits February 20, 2020 17:52

Update aodntools/timeseries_products/velocity_aggregated_timeseries.py

9154492

Co-Authored-By: Marty Hidas <[email protected]>

Update aodntools/timeseries_products/velocity_aggregated_timeseries.py

5e58f45

Co-Authored-By: Marty Hidas <[email protected]>

Update aodntools/timeseries_products/velocity_aggregated_timeseries.py

1703729

Co-Authored-By: Marty Hidas <[email protected]>

remove duplicated functions

26bf302

documentation for the product

ff4b58f

fix function doc string and remove WCUR and HEIGHT_ABOVE_SENSOR check

2992470

remove functions expected to be in utils

cc73b44

change bad_files to dict. remove rejected_files

cc1e8e0

do in-water check and file_sort for good files only.

d99a7d4

-path arg removed

b84fb72

fix the length of the full nan WCUR we WCUR is missing

63fcd92

Merge branch 'velocity_aggregated' of github.com:aodn/python-aodntool…

5ea5d5a

…s into velocity_aggregated update commits accepted in github

sort first, then extract the number of values

dfd0d5f

mhidas reviewed Feb 24, 2020

View reviewed changes

aodntools/timeseries_products/velocity_aggregated_timeseries.py Outdated Show resolved Hide resolved

mhidas reviewed Feb 24, 2020

View reviewed changes

aodntools/timeseries_products/velocity_aggregated_timeseries.py Outdated Show resolved Hide resolved

mhidas reviewed Feb 24, 2020

View reviewed changes

diodon added 3 commits February 25, 2020 09:41

update documentation

2c82b8d

fix load aodn package. change docstring

095a82d

add CELL_INDEX variable and attrs

90df818

mhidas reviewed Feb 25, 2020

View reviewed changes

aodntools/timeseries_products/velocity_aggregated_timeseries.py Outdated Show resolved Hide resolved

diodon and others added 4 commits February 25, 2020 14:35

fix CELL_INDEX

f7dc748

Revert "update documentation"

29484ba

This reverts commit 2c82b8d

edit documentation

a823ba2

velocity_aggregated: documentation and metadata tweaks

d16d37b

mhidas reviewed Feb 25, 2020

View reviewed changes

mhidas merged commit 181a342 into master Feb 25, 2020

mhidas deleted the velocity_aggregated branch February 25, 2020 05:16

		return '; '.join([nc.deployment_code, nc.instrument, nc.instrument_serial_number])


		def in_water(nc):

		ds.close()


		## create the output file name and rename the tmp file


		## Objective

		This product provides aggregated U, V, and W velocity time-series files for each mooring site, without any interpolation or filtering, except for the exclusion of the out-of-water data. For the ADCP instruments, the absulte depth of the measuring cell is calculated using the `DEPTH` measured at the instrument and the `HEIGHT_ABOVE_SENSOR`, The Quality Control (QC) flags are preserved. All the (python) code used for the generation of the products is openly available on GitHub.

Velocity aggregated time series product #92

Velocity aggregated time series product #92

Conversation

diodon commented Feb 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocehugo Feb 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ocehugo commented Feb 4, 2020

mhidas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhidas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhidas commented Feb 25, 2020

Choose a reason for hiding this comment

mhidas commented Feb 25, 2020

ocehugo Feb 4, 2020 •

edited

Loading