Add new station names to raw data files #43

peterdesmet · 2015-09-02T12:33:33Z

For all raw data files, populate StationName with the correct values
After manual verification, move the files to the verified folder.

@bartaelterman, what would be the best approach to do step one with a script?

The text was updated successfully, but these errors were encountered:

peterdesmet · 2015-09-02T13:12:41Z

file	old name	new name
VR2C69_450114_20150728_1.csv	B05 Belwind	bpns-B05BELWIND
VR2W_110779_20150626_1.csv	VG-2	bpns-VG2
VR2W_110783_20150626_1.csv	S4	bpns-S4
VR2W_110784_20150626_1.csv	WK12	bpns-WK12
VR2W_112295_20150615_1.csv	s-8-1	NOT FOUND
VR2W_113521_20150615_1.csv	s-4-1	NOT FOUND
VR2W_113528_20150615_1.csv	s-9-1	NOT FOUND
VR2W_115428_20150625_1.csv	O 6	NOT FOUND
VR2W_115428_20150625_2.csv	O 6	NOT FOUND
VR2W_115430_20150615_1.csv	s-5-1	NOT FOUND
VR2W_115441_20150615_1.csv	s-6-1	NOT FOUND
VR2W_115442_20150615_1.csv	s-7-1	NOT FOUND
VR2W_119047_20150609_1.csv		ma-8
VR2W_119048_20150609_1.csv		ma-6
VR2W_119049_20150609_1.csv		ma-9
VR2W_119052_20150609_1.csv		ma-7
VR2W_119056_20150609_1.csv		ma-5
VR2W_119057_20150609_1.csv		ma-1
VR2W_120092_20150609_1.csv	120092	ma-4
VR2W_120092_20150609_2.csv	120092	ma-4
VR2W_120092_20150609_3.csv	120092	ma-4
VR2W_120095_20150609_1.csv	120095	ma-2
VR2W_120873_20150901_1.csv	ws-pvtss	ws-PVTSS
VR2W_122325_20150615_1.csv	S-3-1	NOT FOUND
VR2W_122339_20150616_1.csv	s-4c-1	NOT FOUND
VR2W_122363_20150615_1.csv	S-4A-1	NOT FOUND
VR2W_122367_20150616_1.csv	s-4b-1	NOT FOUND
VR2W_123823_20150626_1.csv	WK14	bpns-WK14
VR2W_123824_20150626_1.csv	W1	bpns-W1
VR2W_123826_20150626_1.csv	WZ	bpns-WZ
VR2W_123829_20150626_1.csv	S7	bpns-S7
VR2W_126194_20150520_1.csv	126194	ak-42
VR2W_126194_20150824_1.csv	126194	ak-42
VR2W_126195_20150520_1.csv	126195	ak-41
VR2W_126196_20150824_1.csv	126196	ak-44
VR2W_126197_20150824_1.csv	126197	ak-45

peterdesmet · 2015-09-02T13:13:39Z

Files with NOT FOUND = couldn't find station name directly. Will look those up with the receiver code.

peterdesmet · 2015-09-02T13:14:48Z

I noticed the NOT FOUND are dependant on date, @PieterjanVerhelst, maybe it is better if you map those.

PieterjanVerhelst · 2015-09-02T13:56:23Z

I filled in the new names (last column):
VR2W_112295_20150615_1.csv s-8-1 s-8
VR2W_113521_20150615_1.csv s-4-1 s-4
VR2W_113528_20150615_1.csv s-9-1 s-9
VR2W_115428_20150625_1.csv O 6 bpns-OH6
VR2W_115428_20150625_2.csv O 6 bpns-OH6
VR2W_115430_20150615_1.csv s-5-1 s-5
VR2W_115441_20150615_1.csv s-6-1 s-6
VR2W_115442_20150615_1.csv s-7-1 s-7
VR2W_122325_20150615_1.csv S-3-1 s-3
VR2W_122339_20150616_1.csv s-4c-1 s-4c
VR2W_122363_20150615_1.csv S-4A-1 s-4a
VR2W_122367_20150616_1.csv s-4b-1 s-4b

bartaelterman · 2015-09-03T09:28:01Z

I have some code to read raw input files. It detects the format based on the headers, so it knows in which columns the values need to be changed. I also have a command line script that does the aggregation. So I would suggest to update these script and have them change the station names too and write these files to the verified folder. I'll leave the files in the Raw folder, so after manual validation, you can remove them.

bartaelterman · 2015-09-03T10:19:43Z

@peterdesmet what do you mean: they are dependant on the date?
I cannot map empty old names to new names ma-8, ma-6, ma-9 etc. What is the meaning of those?

PieterjanVerhelst · 2015-09-03T21:45:01Z

@bartaelterman : let me know when the raw files where verified. Afterwards I will check them in the raw folder and delete them.
Considering the Meuse data (ma-8, ma-6 etc), probably no station name was given to the receiver when deployed (I don't have that data on my pc, but I'll check them monday at INBO). Is it possible to add the new station name based on receiver ID for those files?

bartaelterman · 2015-09-04T06:34:50Z

Yes @PieterjanVerhelst, I can add the new station name based on receiver id. The receiver id is:

take the file name
split on "_"
take the first two fields
join them with a "-"

?

PieterjanVerhelst · 2015-09-08T19:17:36Z

There are some faults in the metadata about the Meuse receivers (ma-x); I am trying to correct it by the end of the week.
I reconsider my recommendation to add a new station name based on receiver id. As the station_name reflects a location, I think it would be better to add the new station name based on the coordinates (also here some wrong coordinates in the metadata; as soon as they are changed into the right ones, I'll post them here).

PieterjanVerhelst · 2015-09-11T08:00:32Z

Here are the receiver_id's with the matching station names. Some receivers were removed and changed by another receiver for the same location and got deploy number '2'.

receiver_id is_active station_name deploy_number
VR2W-119047 FALSE ma-8 1
VR2W-119048 FALSE ma-6 1
VR2W-119049 FALSE ma-9 1
VR2W-119052 FALSE ma-7 1
VR2W-119056 FALSE ma-5 1
VR2W-119057 FALSE ma-1 1
VR2W-120092 FALSE ma-4 1
VR2W-120095 FALSE ma-2 1
VR2W-122324 FALSE ma-3 1
VR2W-124065 TRUE ma-5 2
VR2W-124066 TRUE ma-4 2
VR2W-124076 TRUE ma-2 2
VR2W-124078 TRUE ma-1 2
VR2W-122324 TRUE ma-3 2

bartaelterman · 2015-09-11T08:04:31Z

Some receivers were removed and changed by another receiever for the same location and got deploy number '2'

Did the opposite happen too? A receiver was redeployed on a different location?

PieterjanVerhelst · 2015-09-11T08:08:31Z

It was removed and deployed again, but still at the same location (ma-3)

bartaelterman · 2015-09-11T08:19:56Z

Ok. So the station code stays the same then.

I'll write a script to substitute the station names and will send you a file this afternoon.

bartaelterman · 2015-09-15T09:41:12Z

I am wondering...

Replacing the old station names by new ones is ok. After some iterations, no old station names will be found in the input anymore, and this step will eventually become obsolete.

Setting the station name based on the receiver id works differently since the receiver id always stays the same. So this action will always remain active. If at some point in the future we do move one receiver from one station to another, we are in trouble. Are we absolutely sure this is how we want to process the raw data?

PieterjanVerhelst · 2015-09-15T09:57:28Z

In the future, the correct station name will come with the csv file, so this step will be unnecessary. I would not set the station name based on the receiver ID for the above mentionned reason: receivers will be translocated in the future (old projects end, new ones arise). Only in the exceptional case of the Meuse receivers (see above), because no station name was given to the receivers, so the only info we have in the csv file is the receiver ID.

bartaelterman · 2015-09-15T11:08:05Z

That's exactly my point. I cannot implement this exceptional case in a script. Setting the station name of these receivers will need to be done manually.

PieterjanVerhelst · 2015-09-15T11:35:08Z

I will change this in the csv files and drop them in verified folder. Afterwards, I will delete them from the Raw folder.

PieterjanVerhelst · 2015-09-15T18:28:47Z

The station names for the Meuse receivers were added and the files are in the Raw folder (as well as the original files). If ok for you, I will delete the old files.

peterdesmet · 2015-09-24T12:44:06Z

Where are we with this step? Anything I need to do?

peterdesmet · 2015-09-24T12:45:30Z

I also noticed doubles in the raw folder: the csv file and a google spreadsheet of the same csv file. Can the google spreadsheets be removed?

PieterjanVerhelst · 2015-09-24T12:59:31Z

Indeed, the google spread sheets can be removed.

peterdesmet · 2015-09-24T13:01:22Z

Now removed.

bartaelterman · 2015-09-25T13:22:42Z

Almost there.

I have a couple of files that don't contain a station name, only a receiver id. For at least the following ids, I would need a new station name.

VR2W-119047
VR2W-119048
VR2W-119049
VR2W-119052
VR2W-119056
VR2W-119057

These can be added to the station names file in the receiver_id and new_name. @peterdesmet can you add these?

See #43 (comment) 3219761 There are additional receivers in the Google Spreadsheet without a station, but didn’t add them (yet).

peterdesmet · 2015-09-25T13:31:35Z

Done. See commit above.

bartaelterman · 2015-09-25T14:06:59Z

With that, I can validate all data in the Raw folder. How shall we go from here:

Is the 2. Verified folder still needed?
@PieterjanVerhelst asked for a merged but not aggregated file. Shall we give these temporary dumps a permanent place? Where? I made a folder 2a. Verified+Consolidated and will give the file a name including the current timestamp. Is that ok?
Shall I run a new aggregation job and update the data in 3. Aggregated?

peterdesmet · 2015-09-25T14:13:33Z

No longer needed I think
I wouldn't give them a permanent place. Maybe a tmp folder? The file name you propose seems OK to me
Yep, you can update the aggregated data. I would also drop them in tmp

If @PieterjanVerhelst agrees with the above, I would just have:

raw
tmp

PieterjanVerhelst · 2015-09-27T09:49:38Z

@bartaelterman: you have got the new station names for the above mentionned receivers? As these are in the file station_names.csv ?

Still some receivers do not have the correct station name applied to the receiver (those from the Meuse and Albert Channel). Therefore, it would be better to keep the raw and verified folder.
Indeed, I would like to have a file with all detections. Maybe we can order maps in following way:
1. Raw
2. Verified
3. Consolidated
4. Aggregated
  I am thinking of doing the aggregation in a pre-analysis phase instead of post-processing (depending on the research question, aggregation can be perforrmed). So this map could be unnecessary in the future.
Ok

bartaelterman · 2015-09-28T08:07:04Z

@PieterjanVerhelst Jep, I have the station names for the above mentioned receivers.

@PieterjanVerhelst how would you suggest to do the mapping then? Would you manually edit certain files then? Because I would rather keep everything in the script. Then the cleaning happens all in one place and everything is reproducible. So can you exactly document how the mapping should be updated?
@peterdesmet you would not give these files a permanent place? So you disagree with @PieterjanVerhelst suggestion?
OK, we'll see where I drop them, depending on 2.

PieterjanVerhelst · 2015-09-28T18:43:00Z

This can be done manually, as it are only a few receivers. We could remove the 'Raw' folder, but the chance exists that a file without the correct station name and coordinates gets in the 'Verified' folder. However, if the script can give an error and indicates which file/receiver has an incorrect station name, the problem can be solved quickly.
I discussed this today with Jan Reubens and we both think that the aggregation would be a part of the pre-analysis of the data. However, until now we would like both consolidated and aggregated to find out what works best.

bartaelterman · 2015-09-29T07:03:13Z

If it can be done automatically (by the script) I'd rather do it that way. The script currently checks the station name. It needs to be filled in, and of the form text-textOrNumbers (or in regular expression syntax: ^[a-zA-Z]+-[0-9a-zA-Z]+$. If something does not match those criteria, an error is raised. I can add in more checks if you like, if that could confidently get us to a situation where we don't have to touch files manually.
Yes, it's no problem for me where the aggregation happens. Let's currently stick with the directories 1. Raw, 2. Consolidated, 3. Aggregated. (I'm dropping Verified, since everything happens automatically, so there is no manual verification more in the process)

peterdesmet · 2015-09-29T07:14:50Z

Agree with @bartaelterman: let's try to cover everything with a script if we can.

Also agree to drop Verified, as it is the output of raw + script, so it can always be done again.

bartaelterman · 2015-09-29T07:17:24Z

So @PieterjanVerhelst can you add the mapping of the missing receivers to station_names.csv?

peterdesmet · 2015-09-29T07:24:43Z

To clarify, those that you mentioned in:

Still some receivers do not have the correct station name applied to the receiver (those from the Meuse and Albert Channel).

PieterjanVerhelst · 2015-09-29T09:12:28Z

Agree, we can use the script. I added the receivers with the correct station name. So when the file will be uploaded, no station name nor coordinates come with the file; only the receiver ID. Therefore the mapping can only happen on receiver ID in these cases.
Considering the database of VLIZ, we would put the files located in the Verified folder in the VLIZ database. Maybe these can be moved to the raw folder, where all the separate csv files are present?

bartaelterman · 2015-09-29T10:41:18Z

Ok
Wouldn't it be more interesting to put the files coming from the 2. Consolidated folder in the VLIZ database? Then they only have 1 format to worry about. All dates will be in the same format and all data is checked.

PieterjanVerhelst · 2015-09-29T10:48:27Z

Consolidated would contain the separate csv files? In that case ok. I thought in that map the concatened file without aggregation would be dropped.

bartaelterman · 2015-09-29T10:56:47Z

Consolidated would contain 1 file with all records in 1 format. Not aggregated, but not separate files.

Why is it important to have separate files?

PieterjanVerhelst · 2015-09-29T11:32:19Z

I don't know. I will check this with VLIZ.

PieterjanVerhelst · 2015-10-02T13:08:22Z

It would be important to have the separate verified files to check the consolidated file if data is missing.
I just spoke to Robin who is building the VLIZ database. This data base is ready, so if the files in the Raw folder could be transformed into verified, the files can be dropped in the database as a test.
Considering dataflow, the final system would work as follows: 'raw' csv files are dropped on an online interface. Then, the file will be processed to a 'verified' file (by coupling data with metadata; no script needed), which will be dropped in the database. As such, no consolidated file is needed.
I think we should have a second meeting with Robin after he tested the database with the verified files from the Drive.

bartaelterman mentioned this issue Sep 25, 2015

Verify script mapping of historical data #44

Closed

peterdesmet added a commit that referenced this issue Sep 25, 2015

Add mapping for specific receivers

94189da

See #43 (comment) 3219761 There are additional receivers in the Google Spreadsheet without a station, but didn’t add them (yet).

Add new station names to raw data files #43

Add new station names to raw data files #43

Comments

peterdesmet commented Sep 2, 2015

peterdesmet commented Sep 2, 2015

peterdesmet commented Sep 2, 2015

peterdesmet commented Sep 2, 2015

PieterjanVerhelst commented Sep 2, 2015

bartaelterman commented Sep 3, 2015

bartaelterman commented Sep 3, 2015

PieterjanVerhelst commented Sep 3, 2015

bartaelterman commented Sep 4, 2015

PieterjanVerhelst commented Sep 8, 2015

PieterjanVerhelst commented Sep 11, 2015

bartaelterman commented Sep 11, 2015

PieterjanVerhelst commented Sep 11, 2015

bartaelterman commented Sep 11, 2015

bartaelterman commented Sep 15, 2015

PieterjanVerhelst commented Sep 15, 2015

bartaelterman commented Sep 15, 2015

PieterjanVerhelst commented Sep 15, 2015

PieterjanVerhelst commented Sep 15, 2015

peterdesmet commented Sep 24, 2015

peterdesmet commented Sep 24, 2015

PieterjanVerhelst commented Sep 24, 2015

peterdesmet commented Sep 24, 2015

bartaelterman commented Sep 25, 2015

peterdesmet commented Sep 25, 2015

bartaelterman commented Sep 25, 2015

peterdesmet commented Sep 25, 2015

PieterjanVerhelst commented Sep 27, 2015

bartaelterman commented Sep 28, 2015

PieterjanVerhelst commented Sep 28, 2015

bartaelterman commented Sep 29, 2015

peterdesmet commented Sep 29, 2015

bartaelterman commented Sep 29, 2015

peterdesmet commented Sep 29, 2015

PieterjanVerhelst commented Sep 29, 2015

bartaelterman commented Sep 29, 2015

PieterjanVerhelst commented Sep 29, 2015

bartaelterman commented Sep 29, 2015

PieterjanVerhelst commented Sep 29, 2015

PieterjanVerhelst commented Oct 2, 2015