-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PECO-239] Apache Arrow support #94
Conversation
…build without errors Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Signed-off-by: Levko Kravets <[email protected]>
Before merging this I'd like to review the interfaces and tests with @rcypher-databricks and Go driver. |
…ke Arrow result handler output compatible with JSON handler Signed-off-by: Levko Kravets <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few surprises here.
Missing:
tests that confirms that arrow on or off do not change the result.
tests to ensure that pagination logic still works with arrow.
e923815
to
077913a
Compare
Signed-off-by: Levko Kravets <[email protected]>
077913a
to
c2dab38
Compare
Signed-off-by: Levko Kravets <[email protected]>
@andrefurlan-db I updated tests so they cover most data types handling. Also, added a separate test to make sure that reponse with multiple arrow batches is also handled as it should. Please take a look. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there! Added a few comments.
} | ||
if (value instanceof MapRow) { | ||
const result = value.toJSON(); | ||
// Map type consists of its key and value types. We need only value type here, key will be cast to string anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure? I thought map[any]any
was the type definition. So you could have a struct as key, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for compatibility with column results - maps are json-serialized as plain objects which could have only string keys. we could try to use JS built-in Map
class to represent maps, but for column results it probably will have some limitations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you run any tests with a map using complex data type as key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned it in Readme in tests/fixtures folder. In short: complex types used as map keys produce malformed json (similar to date/time values), so they could be only handled with Arrow + native types enabled. And even in this case, since JS objects cannot have complex type keys, they will remain stringified. So it will look something like this:
{
'{"s": "hello"}': ...,
'{"s": "world"}': ...,
}
|
||
Known issues: | ||
|
||
- with Arrow disabled _or_ with Arrow native types disabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rcypher-databricks , what is the other serialization issue with arrow disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those cases are the ones I stumbled at. there's even a thread in Slack about date serialization. if you know other similar issues - I'd like to check them as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only other similar issue I ran into was dealing with intervals when arrow native types are turned on. Setting the intervalTypesAsArrow flag to false doesn't apply to intervals that are nested in complex types.
…lways know their timezone Signed-off-by: Levko Kravets <[email protected]>
… of being exposed to user Signed-off-by: Levko Kravets <[email protected]>
df61b4f
to
c5501da
Compare
Signed-off-by: Levko Kravets <[email protected]>
c5501da
to
74746aa
Compare
Signed-off-by: Levko Kravets <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamp. I updated the PAT for e2e tests and kicked-off a new execution.
PECO-239 Add Apache Arrow support
TSparkArrowTypes
supportapache-arrow
classes)Arrow support is always enabled. User can choose whether to represent complex types as JSON-encoded strings or as native Arrow objects by setting a
options.useArrowNativeTypes
option ofIDBSQLSession.executeStatement
(defaults totrue
)TSparkArrowTypes.intervalTypesAsArrow
is not currently supported byapache-arrow
(see apache/arrow#21815) and therefore not exposed to user. PySQL also doesn't support this option, so at least we're consistent here