Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add first_value and last_value metrics with and "ignore nulls" option for use in a window #9539

Open
1 task done
tswast opened this issue Jul 8, 2024 · 7 comments
Labels
feature Features or general enhancements

Comments

@tswast
Copy link
Collaborator

tswast commented Jul 8, 2024

Is your feature request related to a problem?

I would like to implement an efficient, pandas-compatible bfill and ffill in BigQuery DataFrames without custom operators.

What is the motivation behind your request?

To implement bfill and ffill in BigQuery DataFrames (see: #3652 for an issue requesting bfill and ffill in ibis), we add two new operations to Ibis: FirstNonNullValue and LastNonNullValue.

This is then used like FirstNonNullValue(column).over(window), which when the window is ibis.range_window(preceding=0, following=limit), implements a pandas-like bfill with limit parameter support.

Likewise, use LastNonNullValue(column).over(window) with a window of ibis.range_window(preceding=limit, following=0) to implement a pandas-like ffill with limit parameter support.

Describe the solution you'd like

I propose the following two new "metrics", which require a window to operate:

value.first(ignore_nulls=True).over(window)
value.last(ignore_nulls=True).over(window)

Related: IGNORE NULLS in general is tracked here: #7649

Previously, with string-based compilation, this was accomplished with a custom BigQuery operator like:

https://github.com/googleapis/python-bigquery-dataframes/blob/74170dabd323f1b08ad76241e37ff9f2a5b67ab5/third_party/bigframes_vendored/ibis/backends/bigquery/registry.py#L15-L17

In SQLGlot, it does appear there is a First expression with at least some IGNORE NULLS support. My work is in-progress to see if this operator is compatible with the BigQuery dialect.

What version of ibis are you running?

8.x, but working on an upgrade to 9.x right now.

What backend(s) are you using, if any?

BigQuery

Code of Conduct

  • I agree to follow this project's Code of Conduct
@tswast tswast added the feature Features or general enhancements label Jul 8, 2024
@tswast
Copy link
Collaborator Author

tswast commented Jul 8, 2024

In SQLGlot, it does appear there is a First expression with at least some IGNORE NULLS support. My work is in-progress to see if this operator is compatible with the BigQuery dialect.

I see also FirstValue which according to the comment is specifically for windowed contexts. Neither is parametrized, so I'm not sure about IGNORE NULLS support yet.

@tswast
Copy link
Collaborator Author

tswast commented Jul 8, 2024

Filed tobymao/sqlglot#3749 as I am having trouble figuring out how to create the SQLGlot expression. Other examples I found in the compiler use self.agg.avg(...), but I don't think that would account for the IGNORE NULLS part.

@tswast
Copy link
Collaborator Author

tswast commented Jul 9, 2024

Update:

    def visit_FirstNonNullValue(self, op, *, arg):
        return sge.IgnoreNulls(this=sge.FirstValue(this=arg))

This works for the bfill implementation in BigQuery DataFrames. I imagine it should work in most SQLGlot backends. Probably makes sense to make IgnoreNulls a parameter in ibis.

@deepyaman
Copy link
Contributor

Most of the team is at SciPy through the end of this week. This makes sense to me, but would love to get thoughts from @cpcloud and the other maintainers.

@cpcloud
Copy link
Member

cpcloud commented Jul 17, 2024

We should definitely add this functionality, but likely as a flag on the ops that support, which is more than just first/last, so that we don't end up with a set of classes duplicated only to allow them to set ignore_nulls.

@tswast Any chance someone from the BigFrames team wants to take a crack at it?

@jcrist
Copy link
Member

jcrist commented Jul 17, 2024

Note that recently we made first/last/collect uniformly ignore nulls in non-window expressions (previously it was backend-dependent). Another option is to make this also true when used in window expressions, which would both improve uniformity and avoid the need for a new flag.

@cpcloud
Copy link
Member

cpcloud commented Jul 17, 2024

🤔 Perhaps we only expose ignore_nulls on lag/lead which is what an implementation of bfill/ffill would be based on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: backlog
Development

No branches or pull requests

4 participants