-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Velocity aggregated time series product #92
Conversation
with nc4.Dataset(file, 'r') as ds: | ||
time_start.append(np.datetime64(ds.time_deployment_start)) | ||
tuples = sorted(zip(time_start, files_to_agg)) | ||
return [t[1] for t in tuples] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- change
files_to_agg
tofiles
or file_list or file_str_list -> there is no agg here and put the type intention. - Change file to filestr in the loop -> again type intention.
return [file for _,file in tuples]
is clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
allowed_dimensions = ['TIME', 'LATITUDE', 'LONGITUDE', 'HEIGHT_ABOVE_SENSOR'] | ||
required_variables = ['UCUR', 'VCUR', 'WCUR'] | ||
error_list = [] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the allowed_dimensions and required_variables be better in a global variable (outside the functional scope?)?
required_variables = ['UCUR', 'VCUR', 'WCUR'] | ||
error_list = [] | ||
|
||
nc_site_code = nc.site_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using "nc.site_code" directly on the if is better - you are not adding any information or scope by naming it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
if nc_site_code != site_code: | ||
error_list.append('Wrong site_code: ' + nc_site_code) | ||
|
||
nc_file_version = nc.file_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
return error_list | ||
|
||
|
||
def get_nvalues(nc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvalues means nothing here, hence intent is not clear.
e.g. get_number_of_vertical_cells is much better - I don't even need to read the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually,why not just return the HEIGHT and the TIME variables? The intent will be clearer in the function call - The name would be "get_dim_len" or alike.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed the function and variables for clarity. The objective is to return the number of values that result in the flattening of the grid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment below about this function - it's not necessary.
return nvalues, nbins | ||
|
||
|
||
def get_varvalues(nc, varname): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def flat_variable
is clearer IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, changed
return '; '.join([nc.deployment_code, nc.instrument, nc.instrument_serial_number]) | ||
|
||
|
||
def in_water(nc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to use the "nc" class as argument, you just need an array and two strings.
Hence, If it defined as def get_in_water(time, start_str, end_str)
,
the code call would be:
time = nc['TIME']
in_water_ind = get_in_water(time,nc['time_coverage_start'],nc['time_coverage_end'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that the variable name here is confusing - you use nc
to reference to a xarray
dataset. Just rename the variable to xrobj
or xobj
. I did the comment above expecting the arguments to be netCDF4 objects.
This is a case where type hints shines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nc for xarray and ds to netcdf4 dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you can now just import this function from aggregated_timeseries
(and eventually from the utils
/common
module)
|
||
def in_water(nc): | ||
""" | ||
cut data to in-water only timestamps, dropping the out-of-water records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"cut the entire dataset to"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
:return: name of the resulting file, list of rejected files | ||
""" | ||
|
||
varlist = ['UCUR', 'VCUR', 'WCUR', 'DEPTH'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of vars can be defined (promoted) to keyword arguments and be dynamically assigned, if required.
agg_options = {'varlist': ..., 'time_units': , ...] # or even a global dict
aggregate_velocity(...,varlist=[...],time_units=...). # or **agg_options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be but we don't expect to aggregate anything else apart from those variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyway, this is a good practice both codewise and intent wise. For example, when reading, why the function "aggregate velocity" does not provide the option to select which velocity/vars I want to aggregate?
Code wise, "default" options should hardly be defined in the scope of the function. Also, if you put as kwargs and need to change the parameters you just change the function call instead of changing the function code!
rejected_files = [] | ||
|
||
# default name for temporary file. It will be renamed at the end | ||
outfile = 'Velocity_agg_tmp.nc' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use random names - if calling more than one thread at the same time things can get ugly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to UUID name
outfile = 'Velocity_agg_tmp.nc' | ||
|
||
## sort the file list in chronological order | ||
files_to_agg = sort_files(files_to_agg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorted_files is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
for file in files_to_agg: | ||
with xr.open_dataset(file) as nc: | ||
## clip to in water data only | ||
nc = in_water(nc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in_water_only = in_water(nc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or alike
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont hide the state mutation. In another words, do not reuse the name if the state changed - There is no copies here only views
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure it's only a view? The documentation (of Dataset.where
used by in_water
function) is not quite clear, but I reckon it returns a new Dataset object. This is quite a waste if you just want to count the number of data values.
Maybe it's not a big deal in terms of execution time, but in any case you could move the clipping into the if not error_list:
clause so you're not applying it to files you then reject.
## clip to in water data only | ||
nc = in_water(nc) | ||
|
||
varlen_file.append(get_nvalues(nc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually dont need a 15 line get_nvalues function to execute two lines:
DIM_X = 1 if 'HEIGHT_ABOVE_SENSOR' not in nc else x['HEIGHT_ABOVE_SENSOR'].size
file_size = DIM_X*nc['TIME'].size
nc = in_water(nc) | ||
|
||
varlen_file.append(get_nvalues(nc)) | ||
error_list = check_file(nc, site_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check first - move this line and the if statements atop
varlen_list.append(get_nvalues(nc)[0]) | ||
else: | ||
bad_files.append([file, error_list]) | ||
rejected_files.append(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can create a "good_files' list instead and get rid of the loop in line 175. You can print bad files or just estimate the bad files when reporting [x for x in file_agg if x not in good_files]
files_to_agg.remove(file[0]) | ||
|
||
|
||
varlen_list = [0] + varlen_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check emptiness instead of doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zero is needed as the start index in the main loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After going down and back again I understand what this variable is.Hence, uou should rename it -> It;s not a varlen_list, it's an index list.
Also, you may pre-compute the start/end of the indexes (line 214/215) here instead and use it in the for loop (also easier to debug the index ranges).
ds = nc4.Dataset(os.path.join(base_path, outfile), 'w') | ||
OBSERVATION = ds.createDimension('OBSERVATION', size=varlen_total) | ||
INSTRUMENT = ds.createDimension('INSTRUMENT', size=n_files) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create dictionary with options:
obs_float_template = {'datatype':'float','zlib':True,'dimensions':('OBSERVATIONS'),"fill_value":99999.0}
obs_byte_template = {'datatype:'byte','zlib':True,'dimensions':('OBSERVATIONS'),'fill_value':99}
...
vdefs['UCUR'] = {varname:'UCUR',**float_template}
vdefs['UCURqc'] = {varname:'UCURqc',**byte_template};
...
for vname,vopts in vdefs.items():
vars[k] = ds.createVariable(**vopts)
if 'WCUR' in nc.data_vars: | ||
WCUR[start:end] = get_varvalues(nc, 'WCUR') | ||
WCURqc[start:end] = get_varvalues(nc, 'WCUR_quality_control') | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be a case where WCUR is only found in some files...I would imagine that, if I asked for WCUR in varlist, I would have a partially filled vector in the case of missing variables in some files.
|
||
start = sum(varlen_list[:index + 1]) | ||
end = sum(varlen_list[:index + 2]) | ||
n_cells = get_nvalues(nc)[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again - it's not easy to guess what the the second argument returned by get_nvalues is because the function name is not useful (you have to dig into the docstring).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solved
DEPTH[start:end] = nc.DEPTH.values | ||
DEPTHqc[start:end] = nc.DEPTH_quality_control.values | ||
## set TIME and instrument index | ||
TIME[start:end] = (np.repeat(get_varvalues(nc, 'TIME'), n_cells) - epoch) / one_day |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression inside the parenthesis is a bit long. Since this is an assignement operation, I would do without a big parenthesis
TIME[start:end] = np.repeat(...)/one_day - epoch/one_day
The intent of normalization is much better (and should be slightly quicker).
## get and store deployment metadata | ||
LATITUDE[index] = nc.LATITUDE.values | ||
LONGITUDE[index] = nc.LONGITUDE.values | ||
NOMINAL_DEPTH[index] = TStools.get_nominal_depth(nc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the use of this package - TStools. You are only getting netcdf attributes. do you really need to import this package to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the package contains several utility functions. I use three of them in the code. In the future a refactored version of it will be a package with functions used by all product-generating scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's not required -> it's just noise for very little gain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_nominal_depth
returns the value of the NOMINAL_DEPTH
variable, if it exists, and the global attribute instrument_nominal_depth
otherwise, and this is used in all of the products, so it makes sense to have it in a common module.
ds[var].setncatts(variable_attribute_dictionary[var]) | ||
|
||
## set global attrs | ||
timeformat = '%Y-%m-%dT%H:%M:%SZ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another two options - should be kwargs.
timeformat = '%Y-%m-%dT%H:%M:%SZ' | ||
file_timeformat = '%Y%m%d' | ||
|
||
time_start = nc4.num2date(np.min(TIME[:]), time_units, time_calendar).strftime(timeformat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this block a function
set_global_attrs(ds,time_units,time_calendar,time_format,...
options={...}
set_global_attrs(ds,**options)
ds.close() | ||
|
||
|
||
## create the output file name and rename the tmp file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another case for a function output_name = rename_file(oldname,newname)
.
Also, you need to check for overwrites (rename to an already existing one).
Personally, I don't think you need to use "tmp" files or to rename at all - just do the kind of checks you are doing here and above before processing/concat the files (reading attributes, etc). If you think on this steps (check stuff, create a valid name) as a validation step, then you only spend time in valid inputs.
For example, What is the use of an entire concatenated array in memory if you cant find the site_code, facility_code, contributors, etc? Check the easy things first...
@@ -0,0 +1,169 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine you probably have a lot of templates at the moment - maybe a time for refactoring?
For example, a lot of fields are probably repeats from other templates - maybe a template to rule them all, while small templates only put stuff to "overwrite" bits. This may also avoid spread typos, slightly different entries, and help in maintenance (imagine something need to change - you have to remember and change all templates...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is in the plan
@diodon, don't forget to fix the aggregate_timeseries part - tests are not passing and there is some warnings about duplicated fill_values in the ncwriter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks pretty good. Just a few more comments from me :)
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
## get and store deployment metadata | ||
LATITUDE[index] = nc.LATITUDE.values | ||
LONGITUDE[index] = nc.LONGITUDE.values | ||
NOMINAL_DEPTH[index] = TStools.get_nominal_depth(nc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_nominal_depth
returns the value of the NOMINAL_DEPTH
variable, if it exists, and the global attribute instrument_nominal_depth
otherwise, and this is used in all of the products, so it makes sense to have it in a common module.
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
Co-Authored-By: Marty Hidas <[email protected]>
Co-Authored-By: Marty Hidas <[email protected]>
Co-Authored-By: Marty Hidas <[email protected]>
…s into velocity_aggregated update commits accepted in github
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments on the documentation.
|
||
## Objective | ||
|
||
This product provides aggregated U, V, and W velocity time-series files for each mooring site, without any interpolation or filtering, except for the exclusion of the out-of-water data. For the ADCP instruments, the absulte depth of the measuring cell is calculated using the `DEPTH` measured at the instrument and the `HEIGHT_ABOVE_SENSOR`, The Quality Control (QC) flags are preserved. All the (python) code used for the generation of the products is openly available on GitHub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This product provides aggregated U, V, and W velocity time-series files for each mooring site, without any interpolation or filtering, except for the exclusion of the out-of-water data. For the ADCP instruments, the absulte depth of the measuring cell is calculated using the `DEPTH` measured at the instrument and the `HEIGHT_ABOVE_SENSOR`, The Quality Control (QC) flags are preserved. All the (python) code used for the generation of the products is openly available on GitHub. | |
This product provides aggregated U, V, and W velocity time-series files for each mooring site, without any interpolation or filtering, except for the exclusion of the out-of-water data. For the profiling (ADCP) instruments, the absolute depth of the measuring cell is calculated using the `DEPTH` measured at the instrument and the `HEIGHT_ABOVE_SENSOR`, The Quality Control (QC) flags are preserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
aodntools/timeseries_products/Documentation/velocity_aggregated_timeseries.md
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/Documentation/velocity_aggregated_timeseries.md
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/Documentation/velocity_aggregated_timeseries.md
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/Documentation/velocity_aggregated_timeseries.md
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/Documentation/velocity_aggregated_timeseries.md
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/Documentation/velocity_aggregated_timeseries.md
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/velocity_aggregated_timeseries.py
Outdated
Show resolved
Hide resolved
You applied my documentation suggestions to the wrong file (aggregated_timeseries.md instead of velocity_aggregated_timeseries.md)! |
}, | ||
"CELL_INDEX": { | ||
"long_name": "index of the corresponding measuring cell", | ||
"comment": "Cell index is included for reference only and cannot be used to extract values at constant depth. The number and vertical spacing of cells can vary by instrument and deployment. The vertical spacing also varies with time during a deployment. The closest cell to the sensor has index 0." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diodon @ggalibert Do you think this is clear enough?
Ok, I think this is good enough for a first go! |
This product flattens UCUR, VCUR, WCUR and reference the values to its TIME and absolute DEPTH. The values are aggregated from all deployments at one site in an indexed ragged array structure with OBSERVATION and INSTRUMENT as the sole dimensions