-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Specify the semantics of empty Series aggregations #19739
base: main
Are you sure you want to change the base?
docs: Specify the semantics of empty Series aggregations #19739
Conversation
@@ -134,3 +134,42 @@ This means that if you were to use a `lambda` or a custom Python function to app | |||
Polars will try to parallelize the computation of the aggregating functions over the groups, so it is recommended that you avoid using `lambda`s and custom Python functions as much as possible. | |||
Instead, try to stay within the realm of the Polars expression API. | |||
This is not always possible, though, so if you want to learn more about using `lambda`s you can go [the user guide section on using user-defined functions](user-defined-python-functions.md). | |||
|
|||
## Behavior with empty `Series` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferred:
## Behavior with empty `Series` | |
## Aggregations on empty series |
But this might do as well:
## Behavior with empty `Series` | |
## Behavior with empty series |
|
||
## Behavior with empty `Series` | ||
|
||
Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(A sentence per line is good because it makes diffs cleaner and makes it easier to review the docs.)
“Consequently,
.group_by().agg()
on columns withnull
values might result in different results than would be given by an SQL ”
The table shows results for aggregations computed on empty series.
What do empty series have to do with series that contain null
values?
|
||
## Behavior with empty `Series` | ||
|
||
Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You used the word “semantics” 4 times in the first 3 sentences and that's quite a heavy word for a user-friendly user guide.
Here's a possible rewrite in simpler English:
When computing aggregations on empty series, Polars tries to follow set theory and Python's behaviour.
This differs from SQL for some operations: for example, pl.Series([], pl.Int32).sum()
is equal to 0 in Polars but it is NULL
in SQL.
Below we provide an overview of all aggregations and the return value when performed on an empty series.
| `first` | `null` | | ||
| `last` | `null` | | ||
| `quantile` | `null` | | ||
| `get` | n/a | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find the method get
on a series:
>>> import polars as pl
>>> pl.Series([]).get()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Series' object has no attribute 'get'. Did you mean: 'ge'?
|
||
Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series. | ||
|
||
| Aggregation | Empty Series return value | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the null
s in this table are actually None
, aren't they?
And true
should be True
and false
should be False
.
| `get` | n/a | | ||
| `count` | `0` | | ||
| `len` | `0` | | ||
| `implode` | `[ ]` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `implode` | `[ ]` | | |
| `implode` | `[]` | |
|
||
## Behavior with empty `Series` | ||
|
||
Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't agree with my other subjective criticism on this paragraph, at least a couple of adjustments need to be made to fix typos and for consistency with the remainder of the docs:
(Again, one sentence / line would make it easier to review my suggested changes.)
Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series. | |
Polars tries to follow aggregation semantics that match closely [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and Python semantics. This means that we might differ from SQL for operations on empty series. For example, `pl.Series([], dtype=pl.Int32).sum()` is equal to `0` in Polars, but it would be a missing value or `NULL` in SQL. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than those that would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series. |
Or, “but it should be None
if we followed SQL (semantics)”.
To build the docs just run |
Most of the python extension packages that we using are broken on NixOS. I have tried before to set it up and I kind of gave up. |
@rodrigogiraoserrao I cannot really build the docs on my laptop. Could you do a grammar and spellcheck and you could verify that this renders correctly?