-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Improve][Connector-file-base] In large file scenarios, split the single file into multiple shards #8507
base: dev
Are you sure you want to change the base?
Conversation
reader.getNumberOfRows()); | ||
long rowCountPerSplit = rowCountPerSplitByUser; | ||
if (rowCountPerSplit <= 0) { | ||
// 按照文件大小自动分片 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments similar to Chinese can be changed to English, some of them can check for themselves
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
* @return FileSourceSplit set | ||
*/ | ||
@Override | ||
public Set<FileSourceSplit> getFileSourceSplits(String path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for this great work, as you write when enable this it can mark data out of order, so can we enable this feature when user config file_size_per_split/row_count_per_split parameters? if not set, still use the original method. we can describe this feature in document.
And when use this feaute we need the file format can be quick seek to the specified offset, like rows.seekToRow
method.
Now we has other file format support, like parquet
, avro
etc, can you help update other file format too, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
- i will add an option to control whether to enable the feature.
- I think the text format file may not be too large. I will see how to add it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sohurdc !
...main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseSourceConfigOptions.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public void read(FileSourceSplit split, Collector<SeaTunnelRow> output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be best to do some refactoring, I found that the new read method has a lot in common with the old read method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
* In theory, batch column reading can improve reading performance. | ||
*/ | ||
@Override | ||
public void read(FileSourceSplit split, Collector<SeaTunnelRow> output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
… big file read scene.
… for big file read scene.
…gle file into multiple shards
…gle file into multiple shards
…gle file into multiple shards
…/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseSourceConfigOptions.java Co-authored-by: Jia Fan <[email protected]>
…/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java Co-authored-by: Jia Fan <[email protected]>
…/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java Co-authored-by: Jia Fan <[email protected]>
…gle file into multiple shards
…gle file into multiple shards
9d61d4c
to
71e9c36
Compare
…gle file into multiple shards
…gle file into multiple shards
…gle file into multiple shards
…gle file into multiple shards
In large file scenarios, split the single file into multiple shards
Purpose of this pull request
split a single orc file into multi splits. now only support orc fileformat, parquet will be added soon.
Does this PR introduce any user-facing change?
yes, but user can use default options now
How was this patch tested?
i have added test case
Check list
New License Guide
release-note
.