You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is something I noticed recently but worked around by removing bad rows (those where the column count does not match the header count) during pre-processing. However, it can still cause issues for other types of bad rows.
Summary
When using mlcp command-line splits, and depending on the size of the split, mlcp can lose data.
This was observed while ingesting different large files (~1M records) with a small percentage of bad records and various split sizes. It was also observed that the number of unaccounted records changes up and down depending on the split size. It depended on whether the split crossed a bad record or not.
Repro
Generate a large csv file which includes randomly broken rows, like this:
H1,H2
a,b
c,d
d,e,f #Column number mis-match
g,h,
etc..
Note: longer bad rows are better for reproducing the issue.
If the split boundary occurs on a broken row, that row is lost without being reported.
Changing the split size will change the number of rows that are lost without being reported.
Removing the split option will skip the bad rows but they will be reported and everything is accounted for.
The result is that when checking the mlcp log, the totals + skipped do not match the actual number of records in the file. It can seem like everything was successfully ingested because the skips are silently dropped.
This has been tested with several recent versions of mlcp.
The text was updated successfully, but these errors were encountered:
This is something I noticed recently but worked around by removing bad rows (those where the column count does not match the header count) during pre-processing. However, it can still cause issues for other types of bad rows.
Summary
When using mlcp command-line splits, and depending on the size of the split, mlcp can lose data.
This was observed while ingesting different large files (~1M records) with a small percentage of bad records and various split sizes. It was also observed that the number of unaccounted records changes up and down depending on the split size. It depended on whether the split crossed a bad record or not.
Repro
Generate a large csv file which includes randomly broken rows, like this:
Note: longer bad rows are better for reproducing the issue.
If the split boundary occurs on a broken row, that row is lost without being reported.
Changing the split size will change the number of rows that are lost without being reported.
Removing the split option will skip the bad rows but they will be reported and everything is accounted for.
The result is that when checking the mlcp log, the totals + skipped do not match the actual number of records in the file. It can seem like everything was successfully ingested because the skips are silently dropped.
This has been tested with several recent versions of mlcp.
The text was updated successfully, but these errors were encountered: