Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: to_json_string (alternatively, to_json, but that could get confusing with the JSON type) #9542

Open
1 task done
tswast opened this issue Jul 9, 2024 · 4 comments
Open
1 task done
Labels
feature Features or general enhancements

Comments

@tswast
Copy link
Collaborator

tswast commented Jul 9, 2024

Is your feature request related to a problem?

Some DB engines provide a TO_JSON_STRING or TO_JSON method to get a JSON serialization of an arbitrary value.

What is the motivation behind your request?

As seen in #9470, it can be useful to get a string representation of types, where cast to string is either ambiguous or not supported.

BigQuery DataFrames uses TO_JSON_STRING for the same reason (fallback for types that don't support cast to string) as well as for interop with extensions such as passing rows to Remote Functions.

Describe the solution you'd like

Value.to_json_string() would make sense to me. I would avoid Value.to_json(), as I would expect that to return the JSON type in engines that support it.

What version of ibis are you running?

8.x, working on 9.x upgrade

What backend(s) are you using, if any?

BigQuery

Code of Conduct

  • I agree to follow this project's Code of Conduct
@tswast tswast added the feature Features or general enhancements label Jul 9, 2024
@deepyaman
Copy link
Contributor

Seems like a good thing to add that wouldn't be particularly contentious, especially given it's already supported by multiple backends!

@jcrist
Copy link
Member

jcrist commented Aug 7, 2024

Following #9788, I think we might be better served by an as_* prefix here, so as_json_string().

@jcrist
Copy link
Member

jcrist commented Aug 14, 2024

Looking more into this, IIUC semantically this should be the same as:

t.some_col.cast("json").cast("string")

If that's correct, I wonder if we could just stick with that spelling as a user-facing API. For backends that implement an optimized function (and may not handle this in their optimizer themselves?) we could always use a simple rewrite rule to compile to a specific one-call version like bigquery's TO_JSON_STRING.

@tswast
Copy link
Collaborator Author

tswast commented Sep 13, 2024

BigQuery doesn't support casting to JSON directly.

WITH
  Sequences AS (
    SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
    SELECT [2, 4, 8, 16, 32] UNION ALL
    SELECT [5, 10]
  )
SELECT CAST(some_numbers AS JSON) FROM Sequences

fails with Invalid cast from ARRAY<INT64> to JSON at [7:13]

But, I can do this:

WITH
  Sequences AS (
    SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
    SELECT [2, 4, 8, 16, 32] UNION ALL
    SELECT [5, 10]
  )
SELECT TO_JSON(some_numbers) FROM Sequences

Similarly, I can't cast from JSON to STRING.

WITH
  Sequences AS (
    SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
    SELECT [2, 4, 8, 16, 32] UNION ALL
    SELECT [5, 10]
  )
SELECT CAST(TO_JSON(some_numbers) AS STRING) FROM Sequences

fails with Invalid cast from JSON to STRING at [7:13].

So, the SQL for t.some_col.cast("json").cast("string") without any rewrites would be

WITH
  Sequences AS (
    SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
    SELECT [2, 4, 8, 16, 32] UNION ALL
    SELECT [5, 10]
  )
SELECT TO_JSON_STRING(TO_JSON(some_numbers)) FROM Sequences

Comparing the previous SQL with one that omits TO_JSON, BQ uses about 1/2 the "slots" with this SQL (eyeballing based on about 5 query runs each):

WITH
  Sequences AS (
    SELECT [0, 1, 1, 2, 3, 5] AS some_numbers UNION ALL
    SELECT [2, 4, 8, 16, 32] UNION ALL
    SELECT [5, 10]
  )
SELECT TO_JSON_STRING(some_numbers) FROM Sequences

I suspect the reason for the difference is that the JSON type in BQ is optimized for flexible schemas and lookups, so it is a more expensive type to build than just going straight to a JSON string.

That's all to say, a specific optimization for t.some_col.cast("json").cast("string") seems like it could be worthwhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: backlog
Development

No branches or pull requests

3 participants