Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support use of Duration dtype in to_string, ergonomic/perf improvement, tz-aware Datetime bugfix #19697

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Nov 8, 2024

(Supersedes the now-closed #19663).
Closes #7174.

New feature

  • We now support applying dt.to_string() to Duration cols, transforming to an ISO8601 duration string (see: https://en.wikipedia.org/wiki/ISO_8601#Durations for details on this format).
  • The to_string expression (for all temporal dtypes) can now take the shortcut format string "iso", which represents the dtype-specific ISO8601 format for the given column(s). If no explicit format is given, this shortcut string gets set, leading to some nice ergonomic improvements (detailed below).

(Note: the ISO duration string conversion was validated against an established package that does the same conversion; created an ad-hoc parametric test against many thousands of randomly-generated timedeltas, and everything tied-out. Included various edge-cases in the unit tests).

Ergonomics

  • If you do not supply a format string (previously not possible) or you set it to"iso", the associated dtype-specific ISO8601 format is used. This means that you can now make a single expression that will properly format all temporal columns with their dtype-appropriate ISO format, rather than have to separately associate an explicit format string with each dtype.

    Can now write...

    cs.temporal().dt.to_string()

    ...instead of:

    cs.datetime("ms").dt.to_string("%F %T.3f")
    cs.datetime("us").dt.to_string("%F %T.6f")
    cs.datetime("ns").dt.to_string("%F %T.9f")
    cs.datetime("ms","*").dt.to_string("%F %T.3f%:z")
    cs.datetime("us","*").dt.to_string("%F %T.6f%:z")
    cs.datetime("ns","*").dt.to_string("%F %T.9f%:z")
    cs.date("%F")
    cs.time("ms").dt.to_string("%T.3f")
    cs.time("us").dt.to_string("%T.6f")
    cs.time("ns").dt.to_string("%T.9f")

Performance

  • Consolidated the four duration → string functions (and the new "iso" option) into a single fmt_duration_string function, and then optimised it for better performance using itoa conversions and push, push_str calls instead of write! (>= 2x faster in local testing).

Bugfix

  • Transforming a timezone-aware column to string (via cast) did not include the expected ISO timezone offset - spotted this with some new tests and fixed it.

Examples

from datetime import datetime, date, timedelta, time
import polars as pl

df = pl.DataFrame({
    "td": [
        timedelta(days=-1, seconds=-42),
        timedelta(days=14, hours=-10, microseconds=1001),
        timedelta(seconds=0),
    ],
    "tm": [
        time(1, 2, 3, 456789),
        time(23, 59, 9, 101),
        time(0, 0, 0, 1),
    ],
    "dt": [
        date(1999, 3, 1),
        date(2020, 5, 3),
        date(2077, 7, 5),
    ],
    "dtm": [
        datetime(1980, 8, 10, 0, 10, 20),
        datetime(2010, 10, 20, 8, 25, 35),
        datetime(2040, 12, 30, 16, 40, 50),
    ],
}).with_columns(
    dtm_tz = pl.col("dtm").dt.replace_time_zone("Asia/Kathmandu")
)

Cast to dtype-specific ISO8601 string (the default, if no explicit format string given):

df.select(pl.col("td","tm","dt").dt.to_string())
# shape: (3, 3)
# ┌───────────────────┬─────────────────┬────────────┐
# │ td                ┆ tm              ┆ dt         │
# │ ---               ┆ ---             ┆ ---        │
# │ str               ┆ str             ┆ str        │
# ╞═══════════════════╪═════════════════╪════════════╡
# │ -P1DT42.S         ┆ 01:02:03.456789 ┆ 1999-03-01 │
# │ P13DT14H0.001001S ┆ 23:59:09.000101 ┆ 2020-05-03 │
# │ PT0S              ┆ 00:00:00.000001 ┆ 2077-07-05 │
# └───────────────────┴─────────────────┴────────────┘

df.select(pl.col("dtm","dtm_tz").dt.to_string())
# shape: (3, 2)
# ┌────────────────────────────┬──────────────────────────────────┐
# │ dtm                        ┆ dtm_tz                           │
# │ ---                        ┆ ---                              │
# │ str                        ┆ str                              │
# ╞════════════════════════════╪══════════════════════════════════╡
# │ 1980-08-10 00:10:20.000000 ┆ 1980-08-10 00:10:20.000000+05:30 │
# │ 2010-10-20 08:25:35.000000 ┆ 2010-10-20 08:25:35.000000+05:45 │
# │ 2040-12-30 16:40:50.000000 ┆ 2040-12-30 16:40:50.000000+05:45 │
# └────────────────────────────┴──────────────────────────────────┘

Durations do not support strftime; instead to_string recognises only "iso" or "polars" as a formatting string for durations - the latter allows for string output in the some form that we display in the frame repr:

df = pl.DataFrame(
    {
        "td1": [
            timedelta(days=-1),
            timedelta(days=0),
        ],
        "td2": [
            timedelta(weeks=4, days=-10, microseconds=999999),
            timedelta(weeks=20, days=20, hours=13, minutes=45, seconds=22, microseconds=134455),
        ],
    },
    schema={"td1": pl.Duration("ms"), "td2": pl.Duration("ns")},
)

df.select(
    pl.all().dt.to_string("iso").name.suffix("(iso)"),
    pl.all().dt.to_string("polars").name.suffix("(pl)"),
)
# shape: (2, 4)
# ┌──────────┬────────────────────────┬─────────┬───────────────────────────┐
# │ td1(iso) ┆ td2(iso)               ┆ td1(pl) ┆ td2(pl)                   │
# │ ---      ┆ ---                    ┆ ---     ┆ ---                       │
# │ str      ┆ str                    ┆ str     ┆ str                       │
# ╞══════════╪════════════════════════╪═════════╪═══════════════════════════╡
# │ -P1DT0.S ┆ P18DT0.999999S         ┆ -1d     ┆ 18d 999999µs              │
# │ PT0S     ┆ P160DT13H45M22.134455S ┆ 0ms     ┆ 160d 13h 45m 22s 134455µs │
# └──────────┴────────────────────────┴─────────┴───────────────────────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Nov 8, 2024
@alexander-beedie alexander-beedie changed the title feat: Support use of Duration dtype in to_string feat: Support use of Duration dtype in to_string, ergonomic improvement, tz-aware Datetime bugfix Nov 8, 2024
@alexander-beedie alexander-beedie changed the title feat: Support use of Duration dtype in to_string, ergonomic improvement, tz-aware Datetime bugfix feat: Support use of Duration dtype in to_string, ergonomic/perf improvement, tz-aware Datetime bugfix Nov 8, 2024
Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one, just left some comments!

i also notice that this does about 2-3 things together - is it possible to do any of them separately (like the tz-aware datetime bugfix)?

py-polars/polars/expr/datetime.py Show resolved Hide resolved
crates/polars-core/src/chunked_array/temporal/duration.rs Outdated Show resolved Hide resolved
crates/polars-core/src/fmt.rs Outdated Show resolved Hide resolved
crates/polars-core/src/fmt.rs Outdated Show resolved Hide resolved
@alexander-beedie alexander-beedie force-pushed the duration-to-string branch 2 times, most recently from 9f6c6f3 to 646cbf6 Compare November 8, 2024 14:42
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Nov 8, 2024

i also notice that this does about 2-3 things together - is it possible to do any of them separately (like the tz-aware datetime bugfix)?

The bug is fixed as a side-effect of explicitly/centrally declaring the right ISO strings, and everything else is connected, so there isn't really anything to separate out 😅

@alexander-beedie alexander-beedie force-pushed the duration-to-string branch 2 times, most recently from 9949b06 to 546fb33 Compare November 8, 2024 14:51
@alexander-beedie alexander-beedie added the performance Performance issues or improvements label Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Duration => Utf8 cast; currently returns stringified physical (Int64) representation
2 participants