Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Specify the semantics of empty Series aggregations #19739

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

coastalwhite
Copy link
Collaborator

@rodrigogiraoserrao I cannot really build the docs on my laptop. Could you do a grammar and spellcheck and you could verify that this renders correctly?

@github-actions github-actions bot added documentation Improvements or additions to documentation python Related to Python Polars rust Related to Rust Polars labels Nov 12, 2024
@@ -134,3 +134,42 @@ This means that if you were to use a `lambda` or a custom Python function to app
Polars will try to parallelize the computation of the aggregating functions over the groups, so it is recommended that you avoid using `lambda`s and custom Python functions as much as possible.
Instead, try to stay within the realm of the Polars expression API.
This is not always possible, though, so if you want to learn more about using `lambda`s you can go [the user guide section on using user-defined functions](user-defined-python-functions.md).

## Behavior with empty `Series`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferred:

Suggested change
## Behavior with empty `Series`
## Aggregations on empty series

But this might do as well:

Suggested change
## Behavior with empty `Series`
## Behavior with empty series


## Behavior with empty `Series`

Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(A sentence per line is good because it makes diffs cleaner and makes it easier to review the docs.)

“Consequently, .group_by().agg() on columns with null values might result in different results than would be given by an SQL ”

The table shows results for aggregations computed on empty series.
What do empty series have to do with series that contain null values?


## Behavior with empty `Series`

Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You used the word “semantics” 4 times in the first 3 sentences and that's quite a heavy word for a user-friendly user guide.
Here's a possible rewrite in simpler English:

When computing aggregations on empty series, Polars tries to follow set theory and Python's behaviour.
This differs from SQL for some operations: for example, pl.Series([], pl.Int32).sum() is equal to 0 in Polars but it is NULL in SQL.
Below we provide an overview of all aggregations and the return value when performed on an empty series.

| `first` | `null` |
| `last` | `null` |
| `quantile` | `null` |
| `get` | n/a |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find the method get on a series:

>>> import polars as pl
>>> pl.Series([]).get()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Series' object has no attribute 'get'. Did you mean: 'ge'?


Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series.

| Aggregation | Empty Series return value |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the nulls in this table are actually None, aren't they?
And true should be True and false should be False.

| `get` | n/a |
| `count` | `0` |
| `len` | `0` |
| `implode` | `[ ]` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `implode` | `[ ]` |
| `implode` | `[]` |


## Behavior with empty `Series`

Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't agree with my other subjective criticism on this paragraph, at least a couple of adjustments need to be made to fix typos and for consistency with the remainder of the docs:
(Again, one sentence / line would make it easier to review my suggested changes.)

Suggested change
Polars tries to follow aggregation semantics that match closely with [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and python semantics. This means that we might differ from SQL semantics for operations on operations on empty Series. For example, `pl.Series([], pl.Int32).sum()` is equal to `0` in Polars, where it would be a missing value or `NULL` when following SQL semantics. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series.
Polars tries to follow aggregation semantics that match closely [set theory](https://en.wikipedia.org/wiki/Empty_set#Operations_on_the_empty_set) and Python semantics. This means that we might differ from SQL for operations on empty series. For example, `pl.Series([], dtype=pl.Int32).sum()` is equal to `0` in Polars, but it would be a missing value or `NULL` in SQL. Consequently, `.group_by().agg()` on columns with `null` values might result in different results than those that would be given by an SQL engine. Below, we provide an overview of all aggregations and the return value when performed on an empty series.

Or, “but it should be None if we followed SQL (semantics)”.

@rodrigogiraoserrao
Copy link
Collaborator

To build the docs just run mkdocs serve from the root of the repo.
What do you get when you run that command?

@coastalwhite
Copy link
Collaborator Author

Most of the python extension packages that we using are broken on NixOS. I have tried before to set it up and I kind of gave up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants