You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had searched in the issues and found no similar issues.
Description
If the amount of data in a partition is greater than INSERT_BARCH_SIZE, each task commits multiple StreamLoad tasks. If the task fails to retry, all data in the partition is recommitted to the StreamLoad task, as well as the data that was previously successfully written. Data duplication occurs.
当一个分区中的数据量大于参数INSERT_BARCH_SIZE时,每个task便会提交多个StreamLoad任务,如果任务发生失败重试,那么该分区的所有数据便会重新提交StreamLoad任务,对于之前成功写入的数据也会重新提交,造成数据重复。
Search before asking
Description
If the amount of data in a partition is greater than INSERT_BARCH_SIZE, each task commits multiple StreamLoad tasks. If the task fails to retry, all data in the partition is recommitted to the StreamLoad task, as well as the data that was previously successfully written. Data duplication occurs.
当一个分区中的数据量大于参数INSERT_BARCH_SIZE时,每个task便会提交多个StreamLoad任务,如果任务发生失败重试,那么该分区的所有数据便会重新提交StreamLoad任务,对于之前成功写入的数据也会重新提交,造成数据重复。
我的建议是增加一个参数,如果开启则强制每个分区只提交一个StreamLoad,保证数据不会被重复提交。
Solution
My suggestion is to add a parameter that, if enabled, forces only one StreamLoad per partition to ensure that data is not repeatedly committed.
我的建议是增加一个参数,如果开启则强制每个分区只提交一个StreamLoad,保证数据不会被重复提交。
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: