Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improve][Connector-file-base] In large file scenarios, split the single file into multiple shards #8507

Open
wants to merge 14 commits into
base: dev
Choose a base branch
from

Conversation

sohurdc
Copy link
Contributor

@sohurdc sohurdc commented Jan 13, 2025

In large file scenarios, split the single file into multiple shards

Purpose of this pull request

split a single orc file into multi splits. now only support orc fileformat, parquet will be added soon.

  • good:
  1. Speed ​​up reading of large files in parallel mode
  • bad:
  1. The time to read a single file will increase in single mode

Does this PR introduce any user-facing change?

yes, but user can use default options now

How was this patch tested?

i have added test case

Check list

reader.getNumberOfRows());
long rowCountPerSplit = rowCountPerSplitByUser;
if (rowCountPerSplit <= 0) {
// 按照文件大小自动分片
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments similar to Chinese can be changed to English, some of them can check for themselves

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* @return FileSourceSplit set
*/
@Override
public Set<FileSourceSplit> getFileSourceSplits(String path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for this great work, as you write when enable this it can mark data out of order, so can we enable this feature when user config file_size_per_split/row_count_per_split parameters? if not set, still use the original method. we can describe this feature in document.

And when use this feaute we need the file format can be quick seek to the specified offset, like rows.seekToRow method.
Now we has other file format support, like parquet, avro etc, can you help update other file format too, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

  1. i will add an option to control whether to enable the feature.
  2. I think the text format file may not be too large. I will see how to add it later.

Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sohurdc !

}

@Override
public void read(FileSourceSplit split, Collector<SeaTunnelRow> output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to do some refactoring, I found that the new read method has a lot in common with the old read method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* In theory, batch column reading can improve reading performance.
*/
@Override
public void read(FileSourceSplit split, Collector<SeaTunnelRow> output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

wangbin and others added 10 commits January 23, 2025 16:02
…/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseSourceConfigOptions.java

Co-authored-by: Jia Fan <[email protected]>
…/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java

Co-authored-by: Jia Fan <[email protected]>
…/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java

Co-authored-by: Jia Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants