Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velocity aggregated time series product #92

Merged
merged 41 commits into from
Feb 25, 2020
Merged

Velocity aggregated time series product #92

merged 41 commits into from
Feb 25, 2020

Conversation

diodon
Copy link
Contributor

@diodon diodon commented Feb 2, 2020

This product flattens UCUR, VCUR, WCUR and reference the values to its TIME and absolute DEPTH. The values are aggregated from all deployments at one site in an indexed ragged array structure with OBSERVATION and INSTRUMENT as the sole dimensions

@diodon diodon requested review from ocehugo and mhidas February 2, 2020 22:18
@diodon diodon self-assigned this Feb 2, 2020
with nc4.Dataset(file, 'r') as ds:
time_start.append(np.datetime64(ds.time_deployment_start))
tuples = sorted(zip(time_start, files_to_agg))
return [t[1] for t in tuples]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. change files_to_agg to files or file_list or file_str_list -> there is no agg here and put the type intention.
  2. Change file to filestr in the loop -> again type intention.
  3. return [file for _,file in tuples] is clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

allowed_dimensions = ['TIME', 'LATITUDE', 'LONGITUDE', 'HEIGHT_ABOVE_SENSOR']
required_variables = ['UCUR', 'VCUR', 'WCUR']
error_list = []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the allowed_dimensions and required_variables be better in a global variable (outside the functional scope?)?

required_variables = ['UCUR', 'VCUR', 'WCUR']
error_list = []

nc_site_code = nc.site_code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using "nc.site_code" directly on the if is better - you are not adding any information or scope by naming it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

if nc_site_code != site_code:
error_list.append('Wrong site_code: ' + nc_site_code)

nc_file_version = nc.file_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

return error_list


def get_nvalues(nc):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvalues means nothing here, hence intent is not clear.

e.g. get_number_of_vertical_cells is much better - I don't even need to read the docstring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually,why not just return the HEIGHT and the TIME variables? The intent will be clearer in the function call - The name would be "get_dim_len" or alike.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the function and variables for clarity. The objective is to return the number of values that result in the flattening of the grid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment below about this function - it's not necessary.

return nvalues, nbins


def get_varvalues(nc, varname):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def flat_variable is clearer IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, changed

return '; '.join([nc.deployment_code, nc.instrument, nc.instrument_serial_number])


def in_water(nc):
Copy link
Contributor

@ocehugo ocehugo Feb 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to use the "nc" class as argument, you just need an array and two strings.
Hence, If it defined as def get_in_water(time, start_str, end_str),
the code call would be:

time = nc['TIME']
in_water_ind = get_in_water(time,nc['time_coverage_start'],nc['time_coverage_end'])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that the variable name here is confusing - you use nc to reference to a xarray dataset. Just rename the variable to xrobj or xobj. I did the comment above expecting the arguments to be netCDF4 objects.

This is a case where type hints shines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nc for xarray and ds to netcdf4 dataset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you can now just import this function from aggregated_timeseries (and eventually from the utils/common module)


def in_water(nc):
"""
cut data to in-water only timestamps, dropping the out-of-water records.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"cut the entire dataset to"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

:return: name of the resulting file, list of rejected files
"""

varlist = ['UCUR', 'VCUR', 'WCUR', 'DEPTH']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of vars can be defined (promoted) to keyword arguments and be dynamically assigned, if required.

agg_options = {'varlist': ..., 'time_units': , ...] # or even a global dict
aggregate_velocity(...,varlist=[...],time_units=...). # or **agg_options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be but we don't expect to aggregate anything else apart from those variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, this is a good practice both codewise and intent wise. For example, when reading, why the function "aggregate velocity" does not provide the option to select which velocity/vars I want to aggregate?

Code wise, "default" options should hardly be defined in the scope of the function. Also, if you put as kwargs and need to change the parameters you just change the function call instead of changing the function code!

rejected_files = []

# default name for temporary file. It will be renamed at the end
outfile = 'Velocity_agg_tmp.nc'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use random names - if calling more than one thread at the same time things can get ugly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to UUID name

outfile = 'Velocity_agg_tmp.nc'

## sort the file list in chronological order
files_to_agg = sort_files(files_to_agg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorted_files is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

for file in files_to_agg:
with xr.open_dataset(file) as nc:
## clip to in water data only
nc = in_water(nc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in_water_only = in_water(nc)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or alike

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont hide the state mutation. In another words, do not reuse the name if the state changed - There is no copies here only views

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's only a view? The documentation (of Dataset.where used by in_water function) is not quite clear, but I reckon it returns a new Dataset object. This is quite a waste if you just want to count the number of data values.

Maybe it's not a big deal in terms of execution time, but in any case you could move the clipping into the if not error_list: clause so you're not applying it to files you then reject.

## clip to in water data only
nc = in_water(nc)

varlen_file.append(get_nvalues(nc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually dont need a 15 line get_nvalues function to execute two lines:

DIM_X = 1 if 'HEIGHT_ABOVE_SENSOR' not in nc else x['HEIGHT_ABOVE_SENSOR'].size
file_size = DIM_X*nc['TIME'].size

nc = in_water(nc)

varlen_file.append(get_nvalues(nc))
error_list = check_file(nc, site_code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check first - move this line and the if statements atop

varlen_list.append(get_nvalues(nc)[0])
else:
bad_files.append([file, error_list])
rejected_files.append(file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can create a "good_files' list instead and get rid of the loop in line 175. You can print bad files or just estimate the bad files when reporting [x for x in file_agg if x not in good_files]

files_to_agg.remove(file[0])


varlen_list = [0] + varlen_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check emptiness instead of doing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zero is needed as the start index in the main loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After going down and back again I understand what this variable is.Hence, uou should rename it -> It;s not a varlen_list, it's an index list.

Also, you may pre-compute the start/end of the indexes (line 214/215) here instead and use it in the for loop (also easier to debug the index ranges).

ds = nc4.Dataset(os.path.join(base_path, outfile), 'w')
OBSERVATION = ds.createDimension('OBSERVATION', size=varlen_total)
INSTRUMENT = ds.createDimension('INSTRUMENT', size=n_files)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create dictionary with options:

obs_float_template = {'datatype':'float','zlib':True,'dimensions':('OBSERVATIONS'),"fill_value":99999.0}
obs_byte_template = {'datatype:'byte','zlib':True,'dimensions':('OBSERVATIONS'),'fill_value':99}
...
vdefs['UCUR'] = {varname:'UCUR',**float_template}
vdefs['UCURqc'] = {varname:'UCURqc',**byte_template};
...
for vname,vopts in vdefs.items():
  vars[k] = ds.createVariable(**vopts)

if 'WCUR' in nc.data_vars:
WCUR[start:end] = get_varvalues(nc, 'WCUR')
WCURqc[start:end] = get_varvalues(nc, 'WCUR_quality_control')
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be a case where WCUR is only found in some files...I would imagine that, if I asked for WCUR in varlist, I would have a partially filled vector in the case of missing variables in some files.


start = sum(varlen_list[:index + 1])
end = sum(varlen_list[:index + 2])
n_cells = get_nvalues(nc)[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again - it's not easy to guess what the the second argument returned by get_nvalues is because the function name is not useful (you have to dig into the docstring).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solved

DEPTH[start:end] = nc.DEPTH.values
DEPTHqc[start:end] = nc.DEPTH_quality_control.values
## set TIME and instrument index
TIME[start:end] = (np.repeat(get_varvalues(nc, 'TIME'), n_cells) - epoch) / one_day
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression inside the parenthesis is a bit long. Since this is an assignement operation, I would do without a big parenthesis

TIME[start:end] = np.repeat(...)/one_day - epoch/one_day

The intent of normalization is much better (and should be slightly quicker).

## get and store deployment metadata
LATITUDE[index] = nc.LATITUDE.values
LONGITUDE[index] = nc.LONGITUDE.values
NOMINAL_DEPTH[index] = TStools.get_nominal_depth(nc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the use of this package - TStools. You are only getting netcdf attributes. do you really need to import this package to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the package contains several utility functions. I use three of them in the code. In the future a refactored version of it will be a package with functions used by all product-generating scripts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's not required -> it's just noise for very little gain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_nominal_depth returns the value of the NOMINAL_DEPTH variable, if it exists, and the global attribute instrument_nominal_depth otherwise, and this is used in all of the products, so it makes sense to have it in a common module.

ds[var].setncatts(variable_attribute_dictionary[var])

## set global attrs
timeformat = '%Y-%m-%dT%H:%M:%SZ'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another two options - should be kwargs.

timeformat = '%Y-%m-%dT%H:%M:%SZ'
file_timeformat = '%Y%m%d'

time_start = nc4.num2date(np.min(TIME[:]), time_units, time_calendar).strftime(timeformat)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this block a function

set_global_attrs(ds,time_units,time_calendar,time_format,...
options={...}
set_global_attrs(ds,**options)

ds.close()


## create the output file name and rename the tmp file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another case for a function output_name = rename_file(oldname,newname).
Also, you need to check for overwrites (rename to an already existing one).

Personally, I don't think you need to use "tmp" files or to rename at all - just do the kind of checks you are doing here and above before processing/concat the files (reading attributes, etc). If you think on this steps (check stuff, create a valid name) as a validation step, then you only spend time in valid inputs.

For example, What is the use of an entire concatenated array in memory if you cant find the site_code, facility_code, contributors, etc? Check the easy things first...

@@ -0,0 +1,169 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine you probably have a lot of templates at the moment - maybe a time for refactoring?

For example, a lot of fields are probably repeats from other templates - maybe a template to rule them all, while small templates only put stuff to "overwrite" bits. This may also avoid spread typos, slightly different entries, and help in maintenance (imagine something need to change - you have to remember and change all templates...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is in the plan

@ocehugo
Copy link
Contributor

ocehugo commented Feb 4, 2020

@diodon, don't forget to fix the aggregate_timeseries part - tests are not passing and there is some warnings about duplicated fill_values in the ncwriter.

Copy link
Contributor

@mhidas mhidas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks pretty good. Just a few more comments from me :)

## get and store deployment metadata
LATITUDE[index] = nc.LATITUDE.values
LONGITUDE[index] = nc.LONGITUDE.values
NOMINAL_DEPTH[index] = TStools.get_nominal_depth(nc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_nominal_depth returns the value of the NOMINAL_DEPTH variable, if it exists, and the global attribute instrument_nominal_depth otherwise, and this is used in all of the products, so it makes sense to have it in a common module.

Copy link
Contributor

@mhidas mhidas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments on the documentation.


## Objective

This product provides aggregated U, V, and W velocity time-series files for each mooring site, without any interpolation or filtering, except for the exclusion of the out-of-water data. For the ADCP instruments, the absulte depth of the measuring cell is calculated using the `DEPTH` measured at the instrument and the `HEIGHT_ABOVE_SENSOR`, The Quality Control (QC) flags are preserved. All the (python) code used for the generation of the products is openly available on GitHub.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This product provides aggregated U, V, and W velocity time-series files for each mooring site, without any interpolation or filtering, except for the exclusion of the out-of-water data. For the ADCP instruments, the absulte depth of the measuring cell is calculated using the `DEPTH` measured at the instrument and the `HEIGHT_ABOVE_SENSOR`, The Quality Control (QC) flags are preserved. All the (python) code used for the generation of the products is openly available on GitHub.
This product provides aggregated U, V, and W velocity time-series files for each mooring site, without any interpolation or filtering, except for the exclusion of the out-of-water data. For the profiling (ADCP) instruments, the absolute depth of the measuring cell is calculated using the `DEPTH` measured at the instrument and the `HEIGHT_ABOVE_SENSOR`, The Quality Control (QC) flags are preserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@mhidas
Copy link
Contributor

mhidas commented Feb 25, 2020

You applied my documentation suggestions to the wrong file (aggregated_timeseries.md instead of velocity_aggregated_timeseries.md)!

},
"CELL_INDEX": {
"long_name": "index of the corresponding measuring cell",
"comment": "Cell index is included for reference only and cannot be used to extract values at constant depth. The number and vertical spacing of cells can vary by instrument and deployment. The vertical spacing also varies with time during a deployment. The closest cell to the sensor has index 0."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diodon @ggalibert Do you think this is clear enough?

@mhidas
Copy link
Contributor

mhidas commented Feb 25, 2020

Ok, I think this is good enough for a first go!

@mhidas mhidas merged commit 181a342 into master Feb 25, 2020
@mhidas mhidas deleted the velocity_aggregated branch February 25, 2020 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants