ENH: improve support for datetime columns #486

theroggy · 2024-10-17T20:02:49Z

This PR improves support for datetime columns, mainly in read_dataframe and write_dataframe:

Fix: when a GPKG was read with use_arrow, naive datetimes (no timezone) were interpreted as being UTC. So a naive time of 05:00 h was interpreted as 05:00 UTC.
Fix: when a .fgb was read with use_arrow, for datetime columns with a timezone the timezone was dropped, so 05:00+5:00 was read as 05:00.
Fix: when a file was written with use_arrow, for datetime columns with any timezone but UTC, the timezone was dropped, so 05:00+5:00 was written as 05:00 (a naive datetime), for all filetypes.
When reading datetimes with use_arrow, don't convert/represent them as being in UTC time if they have another timezone offset in the dataset.
Add support to write columns with mixed timezones. Typically the column needs to be of the object type with pandas.Timestamp or datetime objects in them as "standard" pandas datetime64 colums don't support mixed timezone offsets in a column.
Add support to read mixed timezone datetimes. These are returned in an object column with Timestamps.
For the cases with use_arrow, the fixes typically require GDAL >= 3.11 (OGRLayer::GetArrowStream(): add a DATETIME_AS_STRING=YES/NO option OSGeo/gdal#11213).

Resolves #487

jorisvandenbossche

Thanks for diving into this and improving the test coverage!

pyogrio/tests/test_geopandas_io.py

…ith-naive-datetimes-with-arrow

jorisvandenbossche

@theroggy thanks for further looking into this!

I do have some doubts about how much effort we should do to cover corner cases and what the desired default behaviour should be, see my comments below.

jorisvandenbossche · 2025-01-18T14:35:56Z

pyogrio/geopandas.py

-    # if object dtype, try parse as utc instead
-    if res.dtype == "object":
-        try:
-            res = pd.to_datetime(ser, utc=True, **datetime_kwargs)
-        except Exception:
-            pass


From your top post explanation:

Add support to read mixed timezone datetimes. These are returned in an object column with Timestamps.

First, I don't think this will work with upcoming pandas 3.x (we are suppressing the warning above about mixed timezones going to raise unless passing utc=True, and that you have to use apply and datetime.datetime.strptime instead to get mixed offset objects)
(but the tests are also passing, so maybe I am missing something)

Second, a column of mixed offset objects is in general not that particularly useful .. So changing this behaviour feels like a regression to me. I understand that we might want to provide the user the option to get this, but by default, I am not sure.

From your top post explanation:

Add support to read mixed timezone datetimes. These are returned in an object column with Timestamps.

First, I don't think this will work with upcoming pandas 3.x (we are suppressing the warning above about mixed timezones going to raise unless passing utc=True, and that you have to use apply and datetime.datetime.strptime instead to get mixed offset objects) (but the tests are also passing, so maybe I am missing something)

Yes, I saw. Do you know what the rationale is that in pandas 3 people are being forced to use a more inefficient way (apply) to get to your data?

Second, a column of mixed offset objects is in general not that particularly useful .. So changing this behaviour feels like a regression to me. I understand that we might want to provide the user the option to get this, but by default, I am not sure.

For starters, to be clear, this is only relevant for mixed timezone data. Data saved in naive or UTC timestamps should just stay "regular" pandas datetime columns.

For the case of mixed timezone data, it depends on what you want to do with the datetime data. If it is just to look at it/show/keep it as it is part of the table data, the Timestamps look just fine to me. If you really want to do "fancy stuff" with the datetimes it will in pandas indeed be more convenient for some things to transform them into e.g. UTC datetimes to get a datetime column instead of an object column.

Regarding default behaviour, it feels quite odd to me to transform data by default to a form where information (the original time zone) is lost. Also because when you save the data again, it will then be saved as UTC as well, so also: the timezone information will be lost.

To me, the other way around is more logical: by default you don't loose data. If you want to do "fancy stuff" with a datetime column that contains mixed timezone data, you convert it to e.g. UTC, typically in an extra column, because most likely you will want to keep the timezone information again when saving.

jorisvandenbossche · 2025-01-18T14:44:49Z

pyogrio/geopandas.py

+        elif col.dtype == "object":
+            # Column of Timestamp objects, also split in naive datetime and tz offset
+            col_na = df[col.notna()][name]
+            if len(col_na) and all(isinstance(x, pd.Timestamp) for x in col_na):


I am a bit hesitant to add custom support for this, exactly given that it is not really supported by pandas itself, do we then need to add special support for it?

Right now, if you have an object dtype column with timestamp columns, they already get written as strings, which in the end preserves the offset information (in the string representation).
It might read back as strings (depending on the file format), but at that point the user can handle this column as they see fit.

If they are written as strings to the output files without the proper metadata, it depends on format to format if they will be recognized as datetimes when read. For text files they will typically be recognized as datetime as the data types are "guessed" when the file is read (e.g. geojson), for files like .fgb and .gpkg they won't be recognized as the file metadata will be wrong.

That's not very clean, and as it is very easy to solve I don't quite see the point of not supporting it properly?

jorisvandenbossche · 2025-01-18T14:51:44Z

pyogrio/geopandas.py

+            elif isinstance(dtype, pd.DatetimeTZDtype):
+                # Also for regular datetime columns with timezone mixed timezones are
+                # possible when thera is a difference between summer and winter time.
+                df[name] = col.apply(lambda x: None if pd.isna(x) else x.isoformat())
+                datetime_cols.append(name)


Is this needed for properly typed datetime64 columns?
What does GDAL do with those values? Write as the UTC value? And with this change it will write it, still as a datetime (because of adding the metadata?), but with offset?

FWIW, if we need to do this, you can do df[name].astype(str) to avoid the apply

Is this needed for properly typed datetime64 columns? What does GDAL do with those values? Write as the UTC value?

It is actually a bit weird. Based on what I tested, when the column is in UTC time zone, the data is written correctly to the file. If the column has another time zone it is simply dropped and the naive times are written. This is the case for both GPKG and e.g. .geojson.

Hence, naive and UTC times can be written via a native arrow datetime column, but in the other cases it needs to be written via a sidestep to a string column.
There is a TIMEZONE option that can be specified in the GDAL arrow code path... but it is only on layer level, so not per column so thats not super useful either. Based on a quick test it also didn't seem to work for timezones like "CET".

And with this change it will write it, still as a datetime (because of adding the metadata?), but with offset?

With this change offsets will be correctly written as timestamp, indeed because of the custom arrow metadata being added and interpreted by GDAL from GDAL 3.11 (OSGeo/gdal#11213)

FWIW, if we need to do this, you can do df[name].astype(str) to avoid the apply

Interesting! Note however that df[name].astype(str) outputs "... ..." instead of "...T...", so no strictly valid ISO... strings. But, astype gives signifficantly better performance (2 sec instead of 12 sec for nz buildings)... and probably not that important... so I changed it. If we rather want to have "...T...", we can add .str.replace(" ", "T") for GDAL < 3.11, that only adds 0.5 sec.
I used df[name].astype("string"), otherwise None/NAT is also cast to a string.

pyogrio/tests/test_geopandas_io.py

jorisvandenbossche · 2025-01-18T15:05:27Z

pyogrio/tests/test_geopandas_io.py

+    if use_arrow and ext == ".gpkg" and __gdal_version__ < (3, 11, 0):
+        pytest.skip("Arrow datetime handling improved in GDAL >= 3.11")


What is not yet working for the case with no tz for GPKG?

Datetimes in a GPKG without timezone are now interpreted as being UTC. So a naive time of 05:00 h is interpreted as 05:00 UTC.

This is one of the issues listed in #487 (comment)

…ith-naive-datetimes-with-arrow

- Test result < GDAL 3.11 instead of skipping - Add UTC test - ...

This reverts commit e35c356.

Needs to be astype"string") instead of astype(str) to support nan values

ENH: deal properly with naive datetimes with arrow

aaf8818

theroggy marked this pull request as draft October 17, 2024 20:02

theroggy mentioned this pull request Oct 17, 2024

Differences in how datetime columns are treated with arrow=True #487

Open

Add more testcases, also for tz datetimes

3e463a1

theroggy changed the title ~~ENH: deal properly with naive datetimes with arrow~~ TST: add tests exposing some issues with datetimes with arrow? Oct 18, 2024

jorisvandenbossche reviewed Nov 6, 2024

View reviewed changes

pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved

pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved

pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved

pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into ENH-deal-properly-w…

afdd0c1

…ith-naive-datetimes-with-arrow

theroggy changed the title ~~TST: add tests exposing some issues with datetimes with arrow?~~ ENH: improve datetime support with arrow for GDAL >= 3.11 Jan 16, 2025

theroggy changed the title ~~ENH: improve datetime support with arrow for GDAL >= 3.11~~ ENH: improve read support for naive and mixed datetimes with arrow for GDAL >= 3.11 Jan 16, 2025

theroggy changed the title ~~ENH: improve read support for naive and mixed datetimes with arrow for GDAL >= 3.11~~ ENH: improve read support for datetimes with arrow for GDAL >= 3.11 Jan 16, 2025

theroggy changed the title ~~ENH: improve read support for datetimes with arrow for GDAL >= 3.11~~ ENH: improve read support for datetime columns with arrow for GDAL >= 3.11 Jan 16, 2025

theroggy added 4 commits January 17, 2025 09:09

Use datetime_as_string for reading with arrow

c18ab22

Update _io.pyx

597855f

Skip tests where appropriate

fa4b86e

Improve support for mixed and naive datetimes

0e41ae4

theroggy changed the title ~~ENH: improve read support for datetime columns with arrow for GDAL >= 3.11~~ ENH: improve support for datetime columns with mixed or naive times Jan 17, 2025

theroggy added 7 commits January 17, 2025 22:42

Skip use_arrow tests with old gdal versions

1378ace

Take in account pandas version

0f1ab27

Update test_geopandas_io.py

6f78c68

Also support columns with datetime objects

336d0d8

Rename some test functions for consistency

3035a11

Avoid warning in test

9efdc09

Improve inline comment

eb80e08

theroggy marked this pull request as ready for review January 18, 2025 08:43

theroggy requested a review from jorisvandenbossche January 18, 2025 08:43

Update CHANGES.md

d50b2d0

jorisvandenbossche reviewed Jan 18, 2025

View reviewed changes

theroggy added 3 commits January 19, 2025 08:27

Merge remote-tracking branch 'upstream/main' into ENH-deal-properly-w…

47aa298

…ith-naive-datetimes-with-arrow

Symplify code

1efa5bf

Don't cast UTC data to string when writing

0032839

Various improvements to tests

9d2bfce

- Test result < GDAL 3.11 instead of skipping - Add UTC test - ...

theroggy marked this pull request as draft January 20, 2025 16:31

theroggy added 2 commits January 20, 2025 17:58

Smal fixes to tests

ca9a8ae

Xfail some tests where needed

deb862c

theroggy marked this pull request as ready for review January 20, 2025 22:11

theroggy changed the title ~~ENH: improve support for datetime columns with mixed or naive times~~ ENH: improve support for datetime columns Jan 22, 2025

theroggy added 7 commits January 22, 2025 22:23

Make UTC assert more specific

e35c356

Revert "Make UTC assert more specific"

593b282

This reverts commit e35c356.

Update test_geopandas_io.py

35d8d87

Use astype("string") instead of apply

41c9da6

Needs to be astype"string") instead of astype(str) to support nan values

Improve tests

f53af87

Fix tests for older versions

a8c85b7

Update test_geopandas_io.py

40ca1a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: improve support for datetime columns #486

ENH: improve support for datetime columns #486

theroggy commented Oct 17, 2024 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche left a comment

jorisvandenbossche Jan 18, 2025

theroggy Jan 19, 2025 •

edited

Loading

jorisvandenbossche Jan 18, 2025

theroggy Jan 20, 2025

jorisvandenbossche Jan 18, 2025

theroggy Jan 20, 2025 •

edited

Loading

jorisvandenbossche Jan 18, 2025

theroggy Jan 20, 2025

		if use_arrow and ext == ".gpkg" and __gdal_version__ < (3, 11, 0):
		pytest.skip("Arrow datetime handling improved in GDAL >= 3.11")

ENH: improve support for datetime columns #486

Are you sure you want to change the base?

ENH: improve support for datetime columns #486

Conversation

theroggy commented Oct 17, 2024 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 18, 2025

Choose a reason for hiding this comment

theroggy Jan 19, 2025 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Jan 18, 2025

Choose a reason for hiding this comment

theroggy Jan 20, 2025

Choose a reason for hiding this comment

jorisvandenbossche Jan 18, 2025

Choose a reason for hiding this comment

theroggy Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Jan 18, 2025

Choose a reason for hiding this comment

theroggy Jan 20, 2025

Choose a reason for hiding this comment

theroggy commented Oct 17, 2024 •

edited

Loading

theroggy Jan 19, 2025 •

edited

Loading

theroggy Jan 20, 2025 •

edited

Loading