-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: improve support for datetime columns #486
Open
theroggy
wants to merge
29
commits into
geopandas:main
Choose a base branch
from
theroggy:ENH-deal-properly-with-naive-datetimes-with-arrow
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+335
−50
Open
Changes from all commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
aaf8818
ENH: deal properly with naive datetimes with arrow
theroggy 3e463a1
Add more testcases, also for tz datetimes
theroggy afdd0c1
Merge remote-tracking branch 'upstream/main' into ENH-deal-properly-w…
theroggy c18ab22
Use datetime_as_string for reading with arrow
theroggy 597855f
Update _io.pyx
theroggy fa4b86e
Skip tests where appropriate
theroggy 0e41ae4
Improve support for mixed and naive datetimes
theroggy 1378ace
Skip use_arrow tests with old gdal versions
theroggy 0f1ab27
Take in account pandas version
theroggy 6f78c68
Update test_geopandas_io.py
theroggy 336d0d8
Also support columns with datetime objects
theroggy 3035a11
Rename some test functions for consistency
theroggy 9efdc09
Avoid warning in test
theroggy eb80e08
Improve inline comment
theroggy d50b2d0
Update CHANGES.md
theroggy 47aa298
Merge remote-tracking branch 'upstream/main' into ENH-deal-properly-w…
theroggy 1efa5bf
Symplify code
theroggy 0032839
Don't cast UTC data to string when writing
theroggy 9d2bfce
Various improvements to tests
theroggy ca9a8ae
Smal fixes to tests
theroggy deb862c
Xfail some tests where needed
theroggy e35c356
Make UTC assert more specific
theroggy 593b282
Revert "Make UTC assert more specific"
theroggy 35d8d87
Update test_geopandas_io.py
theroggy 41c9da6
Use astype("string") instead of apply
theroggy f53af87
Improve tests
theroggy a8c85b7
Fix tests for older versions
theroggy 40ca1a5
Update test_geopandas_io.py
theroggy fc53d44
Merge remote-tracking branch 'upstream/main' into ENH-deal-properly-w…
theroggy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From your top post explanation:
First, I don't think this will work with upcoming pandas 3.x (we are suppressing the warning above about mixed timezones going to raise unless passing
utc=True
, and that you have to useapply
anddatetime.datetime.strptime
instead to get mixed offset objects)(but the tests are also passing, so maybe I am missing something)
Second, a column of mixed offset objects is in general not that particularly useful .. So changing this behaviour feels like a regression to me. I understand that we might want to provide the user the option to get this, but by default, I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw. Do you know what the rationale is that in pandas 3 people are being forced to use a more inefficient way (apply) to get to your data? I did some performance tests and especially the way it is advised in the warning is really slow:
to_datetime()
it takes 0.6 sec to convert 1.5 mio stringsapply
usingfrom_isoformat()
it takes 1 sec to convert 1.5 mio strings, but you will first need to callto_datetime()
first so it can throw an error, so up to 0.6 seconds will need to be added to the time.apply
usingstrptime()
it takes 56 sec to convert 1.5 mio stringsFor starters, to be clear, this is only relevant for mixed timezone data. Data saved in naive or UTC timestamps or if all datetimes have the same timezone offset, they should/will just stay "regular" pandas datetime columns.
Note: a column with datetimes in a timezone with daylight saving time will also typically lead to mixed timezones as they will typically have 2 different timezone offsets.
For the case of mixed timezone data, it depends on what you want to do with the datetime data. If it is just to look at it/show/keep it as it is part of the table data, the Timestamps look just fine to me. If you really want to do "fancy stuff" with the datetimes it will in pandas indeed be more convenient for some things to transform them into e.g. UTC datetimes to get a datetime column instead of an object column.
Regarding default behaviour, it feels quite odd to me to transform data by default to a form where information (the original time zone) is lost. Also because when you save the data again, it will then be saved as UTC as well, so also: the timezone information will be lost.
To me, the other way around is more logical: by default you don't loose data. If you want to do "fancy stuff" with a datetime column that contains mixed timezone data, you convert it to e.g. UTC, typically in an extra column, because most likely you will want to keep the original timezone information again when saving.