`DataFrame(::PyPandasDataFrame)` converts date & datetime to bytes #293

tomdstone · 2023-04-13T18:51:11Z

When doing some work involving dataframes in python via PythonCall, it seems like DataFrame(PyTable(p)) where p is a pandas data table converts the date and datetime columns into byte vectors. Is this issue related to the issue #265 with milliseconds vs microseconds, or due to a missing part of the DataFrame(::PyPandasDataFrame) implementation?

Here are a few minimal examples, in a conda environment with pandas.

using PythonCall
using Dates
using DataFrames

a = DataFrame(x = [now()])  # julia dataframe
b = pytable(a)              # pandas dataframe
c = PyTable(b)              # PyPandasDataFrame
d = DataFrame(c)            # julia dataframe again

This results in:

julia> c
1×1 PyPandasDataFrame
                        x
0 2023-04-13 14:36:13.939

julia> d
1×1 DataFrame
 Row │ x
     │ PyArray…
─────┼───────────────────────────────────
   1 │ UInt8[0xc0, 0x62, 0x31, 0x8c, 0x…

The same thing happens when initially defining b as a pandas dataframe, so the microsecond issue in #265 seems to not be the problem?

julia> b = pd.DataFrame([[dt.datetime.now()]])
Python DataFrame:
                           0
0 2023-04-13 14:46:57.940077

julia> c = PyTable(b)
1×1 PyPandasDataFrame       
                           0
0 2023-04-13 14:46:57.940077

julia> d = DataFrame(c)
1×1 DataFrame
 Row │ 0
     │ PyArray…
─────┼───────────────────────────────────
   1 │ UInt8[0xc8, 0xf9, 0xa5, 0x7d, 0x…

The text was updated successfully, but these errors were encountered:

tomdstone · 2023-04-13T18:55:26Z

If you replace the now() with today() in the specific example I gave, it converts dates correctly, but in my use case where I had two date columns and two datetime columns, both the date and datetime columns are converted to byte vectors

cjdoris · 2023-04-24T17:19:35Z

Just looking at this now. It all comes down to the fact that numpy arrays containing numpy.datetime64 cannot be wrapped as a Julia array yet. Therefore it's falling back on converting each element, and because numpy.datetime64 actually implements the buffer protocol with itemtype=1byte, that gets converted to a PyArray{UInt8} as you're seeing.

So there's two things that could be fixed here:

Support wrapping arrays containing datetime64
Support converting datetime64 to something more useful than PyArray{UInt8}.

cjdoris · 2023-04-24T17:19:52Z

(PS thanks for the report and the MWE)

hhaensel · 2023-06-29T10:28:22Z

Just stumbled across that problem and solved it this way:

function pytimestamp_to_datetime(t::PyArray)
    DateTime(1970) + Second(reinterpret(Int64, t)[1] / 1000000000)
end

function pytimestamp_to_datetime(v::AbstractVector{<:PyArray})
    pytimestamp_to_datetime.(v)
end

EDIT: This only holds for datetime64[ns], otherwise the factor is different. I just filed a tentative PR for a general solution

hhaensel · 2023-06-30T10:14:59Z

@cjdoris
If I eliminate .values in line 61 of PyPandasDataFrame everything runs smoothly. Would that be an option or is there some performance issue or other issue that I am not aware of?

UPDATE: That doesn't work, columns will be vectors of Py objects

hhaensel · 2023-07-01T17:49:14Z

If you replace the now() with today() in the specific example I gave, it converts dates correctly, but in my use case where I had two date columns and two datetime columns, both the date and datetime columns are converted to byte vectors

@tomdstone
Could you check if also your use case where you had difficulties with time is working with the PR in place?
Otherwise, please share an MWE.
I could try to cover your use case as well.

hhaensel · 2023-07-02T21:35:54Z

Just added support for timedelta and timedelta64 conversion to Dates.CompoundPeriod.

tomdstone · 2023-07-03T14:28:58Z

@hhaensel sorry, I won't have time to implement it due to how work is right now (we are in the middle of a move). I might follow up in a few months, but I definitely wouldn't hold your breath waiting for me. I appreciate the thought though!

hhaensel · 2023-07-03T15:17:16Z

Thanks for the reply. No need to hurry. I just filed the PR as I was in need of a solution for my use case.

hhaensel · 2023-07-04T06:54:46Z

@cjdoris
I added automatic conversion from Periods/CompoundPeriods to numpy:timedelta64 as this supports nanoseconds as well. When I use pytimedeltatype, I can only go down to microseconds. Is that a good decision? It seems that pandas can convert the timedelta64 to a timestamp.timedelta that handles nanoseconds correctly.
For conversion from timedelta to Julia I used Dates.CompoundPeriod in order to be type stable. Otherwise one could use Nanosecond, but that seemed a bit weired to me.
I know you're working for version 1 to come out soon. I hope my proposal fits in there somehow.

hhaensel · 2023-07-04T09:34:47Z

Just want to share what's working now with the PR

julia> x = Py(Second(1))
Python: numpy.timedelta64(1,'s')

julia> pyconvert(Second, x)
1 second

julia> x = Py(Second(1) + Nanosecond(1))
Python: numpy.timedelta64(1000000001,'ns')

julia> y = pyconvert(Any, x)
1 second, 1 nanosecond

julia> typeof(y)
Dates.CompoundPeriod


julia> jdf = DataFrame(x = [now() + Second(rand(1:1000)) for _ in 1:100], y = [Second(n) for n in 1:100]);
julia> pdf = pytable(jdf)
Python:
                         x               y
0  2023-07-04 11:29:27.781 0 days 00:00:01
1  2023-07-04 11:30:02.781 0 days 00:00:02
2  2023-07-04 11:40:17.781 0 days 00:00:03
3  2023-07-04 11:31:11.781 0 days 00:00:04
           ... 5 more lines ...
98 2023-07-04 11:37:21.781 0 days 00:01:39
99 2023-07-04 11:35:53.781 0 days 00:01:40

[100 rows x 2 columns]

julia> jdf2 = DataFrame(PyTable(pdf))
100×2 DataFrame
 Row │ x                        y
     │ DateTime                 Compound…   
─────┼──────────────────────────────────────
   1 │ 2023-07-04T11:29:27.781  1 second
   2 │ 2023-07-04T11:30:02.781  2 seconds
   3 │ 2023-07-04T11:40:17.781  3 seconds
  ⋮  │            ⋮                  ⋮
  99 │ 2023-07-04T11:37:21.781  99 seconds
 100 │ 2023-07-04T11:35:53.781  100 seconds
                             95 rows omitted

github-actions · 2023-08-22T01:47:36Z

This issue has been marked as stale because it has been open for 30 days with no activity. If the issue is still relevant then please leave a comment, or else it will be closed in 7 days.

hhaensel · 2023-08-22T06:01:55Z

@cjdoris I think this is still relevant

klwlevy · 2023-08-28T15:31:08Z

@cjdoris I think this is still relevant

I agree. Just stumbled on this when using pandas functionality to convert (legacy) xls data to xlsx. The dates became PyArray{UInt8}.

hhaensel · 2024-06-10T21:15:57Z

I'd be ready to adapt the my PR to the latest changes of PythonCall. Should I continue to work on it? Is the refactoring complete or should I still wait a bit?

cjdoris · 2024-06-13T07:41:39Z

Yep the refactor is done. Can you clarify what the PR will change?

hhaensel · 2024-06-14T06:57:54Z

I think I summarised everything quite nicely above (#293 (comment))

The related PR is #334, where you commented that you are refactoring.

hhaensel · 2024-06-17T06:27:23Z

@cjdoris After refactoring I decided that it is easier to submit a new PR #509

tomdstone changed the title ~~DataFrame(::PyPandasDataFrame) converts datetime to bytes~~ DataFrame(::PyPandasDataFrame) converts datetime to bytes Apr 13, 2023

tomdstone changed the title ~~DataFrame(::PyPandasDataFrame) converts datetime to bytes~~ DataFrame(::PyPandasDataFrame) converts date & datetime to bytes Apr 13, 2023

hhaensel mentioned this issue Jul 1, 2023

Add support for datetime64 #334

Open

github-actions bot added the stale Issues about to be auto-closed label Aug 22, 2023

github-actions bot removed the stale Issues about to be auto-closed label Aug 23, 2023

hhaensel linked a pull request Jun 16, 2024 that will close this issue

Add timedelta, timedelta64 and datetime64 plus respective conversions #509

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DataFrame(::PyPandasDataFrame)` converts date & datetime to bytes #293

`DataFrame(::PyPandasDataFrame)` converts date & datetime to bytes #293

tomdstone commented Apr 13, 2023

tomdstone commented Apr 13, 2023

cjdoris commented Apr 24, 2023

cjdoris commented Apr 24, 2023

hhaensel commented Jun 29, 2023 •

edited

Loading

hhaensel commented Jun 30, 2023 •

edited

Loading

hhaensel commented Jul 1, 2023 •

edited

Loading

hhaensel commented Jul 2, 2023

tomdstone commented Jul 3, 2023

hhaensel commented Jul 3, 2023

hhaensel commented Jul 4, 2023

hhaensel commented Jul 4, 2023

github-actions bot commented Aug 22, 2023

hhaensel commented Aug 22, 2023

klwlevy commented Aug 28, 2023

hhaensel commented Jun 10, 2024

cjdoris commented Jun 13, 2024

hhaensel commented Jun 14, 2024

hhaensel commented Jun 17, 2024

DataFrame(::PyPandasDataFrame) converts date & datetime to bytes #293

DataFrame(::PyPandasDataFrame) converts date & datetime to bytes #293

Comments

tomdstone commented Apr 13, 2023

tomdstone commented Apr 13, 2023

cjdoris commented Apr 24, 2023

cjdoris commented Apr 24, 2023

hhaensel commented Jun 29, 2023 • edited Loading

hhaensel commented Jun 30, 2023 • edited Loading

hhaensel commented Jul 1, 2023 • edited Loading

hhaensel commented Jul 2, 2023

tomdstone commented Jul 3, 2023

hhaensel commented Jul 3, 2023

hhaensel commented Jul 4, 2023

hhaensel commented Jul 4, 2023

github-actions bot commented Aug 22, 2023

hhaensel commented Aug 22, 2023

klwlevy commented Aug 28, 2023

hhaensel commented Jun 10, 2024

cjdoris commented Jun 13, 2024

hhaensel commented Jun 14, 2024

hhaensel commented Jun 17, 2024

`DataFrame(::PyPandasDataFrame)` converts date & datetime to bytes #293

`DataFrame(::PyPandasDataFrame)` converts date & datetime to bytes #293

hhaensel commented Jun 29, 2023 •

edited

Loading

hhaensel commented Jun 30, 2023 •

edited

Loading

hhaensel commented Jul 1, 2023 •

edited

Loading