Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PECO-239] Apache Arrow support #94

Merged
merged 20 commits into from
Mar 13, 2023
Merged

Conversation

kravets-levko
Copy link
Contributor

@kravets-levko kravets-levko commented Nov 22, 2022

PECO-239 Add Apache Arrow support

  • Basic Apache arrow support
  • TSparkArrowTypes support
  • Unit tests
  • E2E tests
  • Have Arrow feature always enabled
  • Option to control if native arrow types should be used or not
  • Return Arrow result as regular javascript object (convert from internal apache-arrow classes)
    • Should be compatible with existing JSON result handler
    • If native types disabled - should parse JSON-serialized fields similarly to existing JSON result handler
    • Tests

Arrow support is always enabled. User can choose whether to represent complex types as JSON-encoded strings or as native Arrow objects by setting a options.useArrowNativeTypes option of IDBSQLSession.executeStatement (defaults to true)

TSparkArrowTypes.intervalTypesAsArrow is not currently supported by apache-arrow (see apache/arrow#21815) and therefore not exposed to user. PySQL also doesn't support this option, so at least we're consistent here

@databricks databricks deleted a comment from codecov-commenter Jan 22, 2023
Signed-off-by: Levko Kravets <[email protected]>
@andrefurlan-db
Copy link

Before merging this I'd like to review the interfaces and tests with @rcypher-databricks and Go driver.

lib/DBSQLSession.ts Outdated Show resolved Hide resolved
tests/e2e/batched_fetch.test.js Outdated Show resolved Hide resolved
@kravets-levko kravets-levko marked this pull request as draft January 25, 2023 19:51
@databricks databricks deleted a comment from codecov-commenter Feb 1, 2023
Copy link

@andrefurlan-db andrefurlan-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few surprises here.
Missing:
tests that confirms that arrow on or off do not change the result.
tests to ensure that pagination logic still works with arrow.

tests/e2e/arrow.test.js Outdated Show resolved Hide resolved
tests/e2e/arrow.test.js Outdated Show resolved Hide resolved
tests/e2e/batched_fetch.test.js Outdated Show resolved Hide resolved
Signed-off-by: Levko Kravets <[email protected]>
@databricks databricks deleted a comment from codecov-commenter Feb 12, 2023
@kravets-levko kravets-levko marked this pull request as ready for review February 12, 2023 14:19
@kravets-levko
Copy link
Contributor Author

@andrefurlan-db I updated tests so they cover most data types handling. Also, added a separate test to make sure that reponse with multiple arrow batches is also handled as it should. Please take a look. Thank you!

@databricks databricks deleted a comment from codecov-commenter Feb 12, 2023
Copy link

@andrefurlan-db andrefurlan-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! Added a few comments.

lib/globalConfig.ts Show resolved Hide resolved
}
if (value instanceof MapRow) {
const result = value.toJSON();
// Map type consists of its key and value types. We need only value type here, key will be cast to string anyway

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure? I thought map[any]any was the type definition. So you could have a struct as key, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for compatibility with column results - maps are json-serialized as plain objects which could have only string keys. we could try to use JS built-in Map class to represent maps, but for column results it probably will have some limitations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run any tests with a map using complex data type as key?

Copy link
Contributor Author

@kravets-levko kravets-levko Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it in Readme in tests/fixtures folder. In short: complex types used as map keys produce malformed json (similar to date/time values), so they could be only handled with Arrow + native types enabled. And even in this case, since JS objects cannot have complex type keys, they will remain stringified. So it will look something like this:

{ 
  '{"s": "hello"}': ..., 
  '{"s": "world"}': ..., 
}

lib/result/ArrowResult.ts Show resolved Hide resolved
lib/result/ArrowResult.ts Show resolved Hide resolved
lib/result/utils.ts Outdated Show resolved Hide resolved
lib/DBSQLSession.ts Show resolved Hide resolved
tests/fixtures/compatibility/index.js Show resolved Hide resolved

Known issues:

- with Arrow disabled _or_ with Arrow native types disabled:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rcypher-databricks , what is the other serialization issue with arrow disabled?

Copy link
Contributor Author

@kravets-levko kravets-levko Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those cases are the ones I stumbled at. there's even a thread in Slack about date serialization. if you know other similar issues - I'd like to check them as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other similar issue I ran into was dealing with intervals when arrow native types are turned on. Setting the intervalTypesAsArrow flag to false doesn't apply to intervals that are nested in complex types.

…lways know their timezone

Signed-off-by: Levko Kravets <[email protected]>
@kravets-levko kravets-levko changed the title (WIP) [PECO-239] Apache Arrow support [PECO-239] Apache Arrow support Mar 13, 2023
Copy link

@andrefurlan-db andrefurlan-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Copy link

@susodapop susodapop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp. I updated the PAT for e2e tests and kicked-off a new execution.

@databricks databricks deleted a comment from codecov-commenter Mar 13, 2023
@kravets-levko kravets-levko merged commit 0332513 into main Mar 13, 2023
@kravets-levko kravets-levko deleted the PECO-239-apache-arrow-support branch March 13, 2023 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants