[DRAFT] [DEMO] [SPARK] Use explicit Rowset time formatters for improving RowSet generation #5815

bowenliang123 · 2023-12-04T14:41:56Z

🔍 Description

Issue References 🔗

Subtask of #5808.

Describe Your Solution 🔧

Types of changes 🔖

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

Checklists

📝 Author Self Checklist

My code follows the style guidelines of this project
I have performed a self-review
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
This patch was not authored or co-authored using Generative Tooling

📝 Committer Pre-Merge Checklist

Be nice. Be informative.

bowenliang123 · 2023-12-04T14:45:32Z

It seems that the performance of (with-skip jump) primitive data types (shown in red block) is lowered, while of the fall-through data types(shown in the green block of Decimials, Date, and etc.) is improved.

cc @wForget @yaooqinn @pan3793

WDYT?

wForget · 2023-12-04T15:24:53Z

It seems that the performance of (with-skip jump) primitive data types (shown in red block) is lowered, while of the fall-through data types(shown in the green block of Decimials, Date, and etc.) is improved.

I don't see any changes related to primitive data types, can you increase the number of rows and try a few more times?

codecov-commenter · 2023-12-04T16:27:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (25eb53d) 61.36% compared to head (2a84cbb) 61.41%.

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #5815      +/-   ##
============================================
+ Coverage     61.36%   61.41%   +0.04%     
  Complexity       23       23              
============================================
  Files           608      608              
  Lines         35941    35944       +3     
  Branches       4940     4940              
============================================
+ Hits          22056    22075      +19     
+ Misses        11494    11482      -12     
+ Partials       2391     2387       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pan3793 · 2023-12-05T05:41:05Z

@bowenliang123 I guess the overhead may come from the always construction of HiveResult.getTimeFormatters

bowenliang123 · 2023-12-05T05:54:07Z

@bowenliang123 I guess the overhead may come from the always construction of HiveResult.getTimeFormatters

Yes. For most cases, shown from evidence. But the overhead for primitive types should not comes from the data types, the time formatters are reused in the whole stage of TRowSet generation.

Still have to see more evidence for repeated tests. Please consider merge #5809 to introduce benchmark with warmups and repeating rounds, for the reference of further investigation.

wForget · 2023-12-05T12:08:01Z

Still have to see more evidence for repeated tests. Please consider merge #5809 to introduce benchmark with warmups and repeating rounds, for the reference of further investigation.

I think this is an obvious improvement. The benchmark unit test is controversial and not required, we can remove it first and continue reviewing the PR.

bowenliang123 · 2023-12-05T14:20:48Z

Closing this PR with no enough consensus on the purposes, the design, the changes and the approaches.

bowenliang123 added 2 commits December 4, 2023 17:42

add benchmark ut for Spark TRowSet generation

3619589

explicit time formatters

f35ba84

github-actions bot added the module:spark label Dec 4, 2023

update

57285b4

use spark benchmark

9e863c2

github-actions bot added the kind:build label Dec 4, 2023

add benchmark for all types

2a84cbb

bowenliang123 closed this Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] [DEMO] [SPARK] Use explicit Rowset time formatters for improving RowSet generation #5815

[DRAFT] [DEMO] [SPARK] Use explicit Rowset time formatters for improving RowSet generation #5815

bowenliang123 commented Dec 4, 2023 •

edited

Loading

bowenliang123 commented Dec 4, 2023 •

edited

Loading

wForget commented Dec 4, 2023

codecov-commenter commented Dec 4, 2023 •

edited

Loading

pan3793 commented Dec 5, 2023

bowenliang123 commented Dec 5, 2023

wForget commented Dec 5, 2023

bowenliang123 commented Dec 5, 2023

[DRAFT] [DEMO] [SPARK] Use explicit Rowset time formatters for improving RowSet generation #5815

[DRAFT] [DEMO] [SPARK] Use explicit Rowset time formatters for improving RowSet generation #5815

Conversation

bowenliang123 commented Dec 4, 2023 • edited Loading

🔍 Description

Issue References 🔗

Describe Your Solution 🔧

Types of changes 🔖

Test Plan 🧪

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request 🎉

Related Unit Tests

Checklists

📝 Author Self Checklist

📝 Committer Pre-Merge Checklist

bowenliang123 commented Dec 4, 2023 • edited Loading

wForget commented Dec 4, 2023

codecov-commenter commented Dec 4, 2023 • edited Loading

Codecov Report

pan3793 commented Dec 5, 2023

bowenliang123 commented Dec 5, 2023

wForget commented Dec 5, 2023

bowenliang123 commented Dec 5, 2023

bowenliang123 commented Dec 4, 2023 •

edited

Loading

bowenliang123 commented Dec 4, 2023 •

edited

Loading

codecov-commenter commented Dec 4, 2023 •

edited

Loading