[Improve][Connector-file-base] In large file scenarios, split the single file into multiple shards #8507

sohurdc · 2025-01-13T02:58:36Z

In large file scenarios, split the single file into multiple shards

Purpose of this pull request

split a single orc file into multi splits. now only support orc fileformat, parquet will be added soon.

good:

Speed up reading of large files in parallel mode

bad:

The time to read a single file will increase in single mode

Does this PR introduce any user-facing change?

yes, but user can use default options now

How was this patch tested?

i have added test case

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config
Update the release-note.

corgy-w · 2025-01-13T16:04:32Z

.../main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java

+                    reader.getNumberOfRows());
+            long rowCountPerSplit = rowCountPerSplitByUser;
+            if (rowCountPerSplit <= 0) {
+                // 按照文件大小自动分片


Comments similar to Chinese can be changed to English, some of them can check for themselves

liunaijie · 2025-01-14T01:15:16Z

.../main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java

+     * @return FileSourceSplit set
+     */
+    @Override
+    public Set<FileSourceSplit> getFileSourceSplits(String path) {


Hi, thanks for this great work, as you write when enable this it can mark data out of order, so can we enable this feature when user config file_size_per_split/row_count_per_split parameters? if not set, still use the original method. we can describe this feature in document.

And when use this feaute we need the file format can be quick seek to the specified offset, like rows.seekToRow method.
Now we has other file format support, like parquet, avro etc, can you help update other file format too, thanks.

ok.

i will add an option to control whether to enable the feature.

I think the text format file may not be too large. I will see how to add it later.

Hisoka-X

Thanks @sohurdc !

...main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseSourceConfigOptions.java

.../main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java

Hisoka-X · 2025-01-15T03:55:28Z

.../main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java

+    }
+
+    @Override
+    public void read(FileSourceSplit split, Collector<SeaTunnelRow> output)


It would be best to do some refactoring, I found that the new read method has a lot in common with the old read method.

Hisoka-X · 2025-01-15T03:55:39Z

...n/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/ParquetReadStrategy.java

+     * In theory, batch column reading can improve reading performance.
+     */
+    @Override
+    public void read(FileSourceSplit split, Collector<SeaTunnelRow> output)


… big file read scene.

… for big file read scene.

…gle file into multiple shards

…/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseSourceConfigOptions.java Co-authored-by: Jia Fan <[email protected]>

…/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java Co-authored-by: Jia Fan <[email protected]>

…gle file into multiple shards

github-actions bot added connectors-v2 file labels Jan 13, 2025

corgy-w reviewed Jan 13, 2025

View reviewed changes

liunaijie reviewed Jan 14, 2025

View reviewed changes

Hisoka-X reviewed Jan 15, 2025

View reviewed changes

wangbin and others added 10 commits January 23, 2025 16:02

[Improve][Connector-file-base] split a orc file into multi splits for…

d574f94

… big file read scene.

[Improve][Connector-file-base] split a parquet file into multi splits…

369b8fb

… for big file read scene.

[Improve][Connector-file-base] In large file scenarios, split the sin…

644ae66

…gle file into multiple shards

[Improve][Connector-file-base] In large file scenarios, split the sin…

6b7a8e3

…gle file into multiple shards

[Improve][Connector-file-base] In large file scenarios, split the sin…

6e2c58c

…gle file into multiple shards

Update seatunnel-connectors-v2/connector-file/connector-file-base/src…

2b5152f

…/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/BaseSourceConfigOptions.java Co-authored-by: Jia Fan <[email protected]>

Update seatunnel-connectors-v2/connector-file/connector-file-base/src…

d63abd4

…/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java Co-authored-by: Jia Fan <[email protected]>

Update seatunnel-connectors-v2/connector-file/connector-file-base/src…

6002d9f

…/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java Co-authored-by: Jia Fan <[email protected]>

[Improve][Connector-file-base] In large file scenarios, split the sin…

4dd7269

…gle file into multiple shards

[Improve][Connector-file-base] In large file scenarios, split the sin…

71e9c36

…gle file into multiple shards

github-actions bot added document core SeaTunnel core module Spark Zeta transform-v2 e2e format api and removed file labels Jan 23, 2025

sohurdc force-pushed the orc_bigfile_split_read branch from 9d61d4c to 71e9c36 Compare January 23, 2025 08:10

github-actions bot removed document core SeaTunnel core module Zeta transform-v2 e2e labels Jan 23, 2025

github-actions bot added file and removed format api labels Jan 23, 2025

binwang219962 added 4 commits January 23, 2025 16:58

[Improve][Connector-file-base] In large file scenarios, split the sin…

e32d5cb

…gle file into multiple shards

[Improve][Connector-file-base] In large file scenarios, split the sin…

c7f3d59

…gle file into multiple shards

[Improve][Connector-file-base] In large file scenarios, split the sin…

4f462b6

…gle file into multiple shards

[Improve][Connector-file-base] In large file scenarios, split the sin…

67a3458

…gle file into multiple shards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improve][Connector-file-base] In large file scenarios, split the single file into multiple shards #8507

[Improve][Connector-file-base] In large file scenarios, split the single file into multiple shards #8507

sohurdc commented Jan 13, 2025

corgy-w Jan 13, 2025

sohurdc Jan 14, 2025

liunaijie Jan 14, 2025

sohurdc Jan 14, 2025

Hisoka-X left a comment •

edited

Loading

Hisoka-X Jan 15, 2025

sohurdc Jan 16, 2025

Hisoka-X Jan 15, 2025

[Improve][Connector-file-base] In large file scenarios, split the single file into multiple shards #8507

Are you sure you want to change the base?

[Improve][Connector-file-base] In large file scenarios, split the single file into multiple shards #8507

Conversation

sohurdc commented Jan 13, 2025

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

corgy-w Jan 13, 2025

Choose a reason for hiding this comment

sohurdc Jan 14, 2025

Choose a reason for hiding this comment

liunaijie Jan 14, 2025

Choose a reason for hiding this comment

sohurdc Jan 14, 2025

Choose a reason for hiding this comment

Hisoka-X left a comment • edited Loading

Choose a reason for hiding this comment

Hisoka-X Jan 15, 2025

Choose a reason for hiding this comment

sohurdc Jan 16, 2025

Choose a reason for hiding this comment

Hisoka-X Jan 15, 2025

Choose a reason for hiding this comment

Hisoka-X left a comment •

edited

Loading