Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support timezones when reading and writing datetime columns #123

Open
theroggy opened this issue Jun 7, 2022 · 10 comments
Open

Support timezones when reading and writing datetime columns #123

theroggy opened this issue Jun 7, 2022 · 10 comments

Comments

@theroggy
Copy link
Member

theroggy commented Jun 7, 2022

Timezone information is ignored at the moment, both for reading and for writing.

See https://github.com/Toblerity/Fiona/blob/791785b813fce97e5c3d60bd725fa2190645d2e3/fiona/ogrext.pyx#L401-L404 and https://github.com/Toblerity/Fiona/blob/791785b813fce97e5c3d60bd725fa2190645d2e3/fiona/ogrext.pyx#L431-L435 how fiona handles this (and https://gdal.org/development/rfc/rfc56_millisecond_precision.html?highlight=TZFlag#core-changes with very brief details)

The PR that introduced this in fiona: Toblerity/Fiona#915

@theroggy theroggy changed the title Support timezones when writing (and reading?) timezones Support timezones when reading and writing datetime columns Jun 7, 2022
@m-richards
Copy link
Member

Hey @theroggy have you done any work on this? If not I might have a look if that's okay.

@theroggy
Copy link
Member Author

Hey @theroggy have you done any work on this? If not I might have a look if that's okay.

@m-richards No I haven't... feel free!

@m-richards
Copy link
Member

m-richards commented Aug 22, 2022

So having a look at this in a bit more detail there's more complexity than I originally realised (unless I'm missing something), numpy datetime64 doesn't support utc offsets- which seems to make numpy arrays as the intermediate between dataframes and ogr a little tricky to make work.

Some thoughts:

  • Create a custom representation of a utc offset field as a datetime field and a second field encoding offset - bad for interop with other tools so not ideal. Edit: this could just be s coupled set of numpy arrays that then get recombined at the ogr write step which would not have an issue of inconsistent formats.
  • Convert timezone aware fields to UTC on write, read them back as utc - simpler, but not a fully faithful round-trip, still need to handle naive datetimes and distinguishing them from utc ones.
  • The last thought I have seems rather complex, but would it make sense to use the pandas datetime64[ns, tz] extension array dtype and feed that down (or the underlying representation) to ogr_write

(apologies if I'm missing something obvious, I thought this might be alright point to get familiar with some the pyogrio internals, but still verying much learning at this point)

@theroggy
Copy link
Member Author

Hmm... indeed... when writing it seems the datetimes at the moment will be implicitly converted to UTC already during the conversion of pandas series to numpy arrays:
https://pandas.pydata.org/docs/reference/api/pandas.Series.values.html

@theroggy
Copy link
Member Author

theroggy commented Aug 22, 2022

At the time of adding write support I noticed that the conversion to python datetime took quite some time... and so I wondered if it would be faster to convert the datetime columns to string and using OGRParseDate to let gdal parse them back to datetimes.

Not a super elegant solution, but it might be an option here as well...

@jorisvandenbossche
Copy link
Member

and so I wondered if it would be faster to convert the datetime columns to string and using OGRParseDate to let gdal parse them back to datetimes.

That could indeed be something to test and time.
Unfortunately, I don't see any API for the reverse? (so we could also read the datetimes through a string representation, and then parse those on our side) Because that could also have been a solution for carrying the timezone information in the string.

It seems that the OGRParseDate ignores milliseconds though, so if we do this it would have to be conditional on the presence of that, making it less attractive option ..

@jorisvandenbossche
Copy link
Member

Convert timezone aware fields to UTC on write, read them back as utc - simpler, but not a fully faithful round-trip, still need to handle naive datetimes and distinguishing them from utc ones.

I think getting back the datetimes in UTC would already be an improvement, and maybe the best we can do on the short term.

@jorisvandenbossche
Copy link
Member

I was looking into this a bit, some observations:

  • For reading, we could use OGR_F_GetFieldAsString on a datetime field (instead of OGR_F_GetFieldAsDateTimeEx), to get a string formatted as "%Y/%m/%d %H:%M:%S(%z)" (so ISO formatting except the use of / instead of - for the date part). That way we could delay the parsing to datetime64 dtype, and for read_dataframe use pandas' to_datetime parser, which can handle strings with tz offsets to result in a UTC datetime64 dtype. This is also faster than the current implementation (data[i] = datetime.datetime(..).isoformat() in the reading loop).
    One possible way to do this would be a datetime_as_string=False/True keyword to be added to read(), which can be propagated to get_fields/process_fields in _io.pyx. Alternatively, we could also always do this for datetime fields in _io.pyx, and only after reading all features convert the strings to datetime64[ms] dtyped array.

  • An alternative solution for reading would be to account for the tz offset when creating the datetime.datetime object (so we get the UTC values), but then we still need a way to signal to read_dataframe / pandas that the the datetime64[ms] array is tz-aware (as UTC) and not tz-naive, so we can use the correct dtype for the column in the resulting geodataframe.
    We could somehow keep track of this and then returns those flags from ogr_read (in an extra key in the meta dict?), but that also sounds quite ugly to do this separately from returning the data (in the current way this is implemented, we are limited in returning just numpy arrays, and not more information about the original data type)

  • For writing, we have a similar problem that we are passing in a numpy datetime64 array as UTC values (which we already do), but so inside ogr_write we just see the numpy array, without knowing whether it was originally tz-aware (and we need to indicate that it is UTC when setting the field) or tz-naive. So also here we would pass down some kind of flag per column to indicate this extra information about the dtype. Or maybe, now I am writing this, we could also pass down the original pandas dtype per column, in addition to the numpy arrays of field_data. That's maybe more generic, and that information could be used then to check whether it's tz-aware or not in _io.pyx.

  • Also for writing, we could check if going to a string repr instead of creating a datetime.datetime object (to access all the datetime fields) is faster. As mentioned above, OGRParseDate (which GDAL will use under the hood when calling OGR_F_SetFieldString on a datetime field) documents to ignore milliseconds, but that's something we should verify, as I am not sure that is actually the case based on a quick look at the implementation.

@m-richards
Copy link
Member

I've been looking at this again too and have opened a very rough proof of concept at #253, to try and get us moving forward on this (joris, you already do have detailed notes, I don't know if you have a prototype as well, more than happy for that to be superceded). Some things I've learnt as part of doing that:

  • I opted to read datetimes as strings, rather than look at accounting for the tz offset when creating datetime.datetime objects inside ogr_read. Basically the string parsing just seemed simpler as a first draft
    (Separately, the PR currently overlooks the case of the numpy api trying to read a spatial file with timezones without converting to strings, this probably should be added)

  • For writing,

    inside ogr_write we just see the numpy array, without knowing whether it was originally tz-aware

    we could also pass down the original pandas dtype per column, in addition to the numpy arrays of field_data

    On both of these GDAL is expecting values in a local time, not utc. That's fine, we can convert a pandas datetime column to naïve, and then pass this down in the numpy array. However, we then need to recover the timezone offset for each row. Originally I thought we could pass down the timezone information per column, but this doesn't work if the original pandas data was e.g in timzone Australia/Sydney which has summer daylight saving. In these cases, the offset can be row dependent, so we'd need to track this for each row. If you pass the dtypes down, to actually use these to recover a timezone offset, you have to re-localize the the datetime objects, something like;

     datetime = field_value.astype("datetime64[ms]").item()
     gdal_tz = datetime.astimezone(tz).utcoffset() / datetime.timedelta(minutes=15)) + 100

at least as far as I can tell (I might be missing something). At this point I decided to park this and look at writing strings, I can't imagine this is very performant (but if we decide it's worth benchmarking explicitly it I can go back and have a look).

@jorisvandenbossche
Copy link
Member

Thanks Matt!

[For writing] On both of these GDAL is expecting values in a local time, not utc. That's fine, we can convert a pandas datetime column to naïve, and then pass this down in the numpy array. However, we then need to recover the timezone offset for each row.

I think going through strings is a good alternative to explore, but in case we want to get back to using OGR_F_SetFieldDateTimeEx, I think it could also be a sufficient option to just store tz-aware data as UTC (yes, that doesn't preserve actual offset / local time information, but at least it preserves the time point information). For that, we can pass the numpy array with UTC values down to ogr_write, and then only need a single True/False flag for that column if it's local or GMT.

At this point I decided to park this and look at writing strings, I can't imagine this is very performant (but if we decide it's worth benchmarking explicitly it I can go back and have a look).

I think it might be worth checking performance, but I don't necessarily expect it to be too bad. Currently, the code path for writing datetimes is also not great, because we need to convert each timestamp value (stored as a single integer number by numpy/pandas) to a python datetime.datetime object (which stores components, and get the year, month, day, hour etc from it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants