MLCP skips bad records without reporting when using splits #147

eurochriskelly · 2020-06-26T15:20:52Z

This is something I noticed recently but worked around by removing bad rows (those where the column count does not match the header count) during pre-processing. However, it can still cause issues for other types of bad rows.

Summary

When using mlcp command-line splits, and depending on the size of the split, mlcp can lose data.

This was observed while ingesting different large files (~1M records) with a small percentage of bad records and various split sizes. It was also observed that the number of unaccounted records changes up and down depending on the split size. It depended on whether the split crossed a bad record or not.

Repro

Generate a large csv file which includes randomly broken rows, like this:

H1,H2
a,b
c,d
d,e,f   #Column number mis-match
g,h,
etc..

Note: longer bad rows are better for reproducing the issue.

If the split boundary occurs on a broken row, that row is lost without being reported.
Changing the split size will change the number of rows that are lost without being reported.
Removing the split option will skip the bad rows but they will be reported and everything is accounted for.

The result is that when checking the mlcp log, the totals + skipped do not match the actual number of records in the file. It can seem like everything was successfully ingested because the skips are silently dropped.

This has been tested with several recent versions of mlcp.

The text was updated successfully, but these errors were encountered:

yunzvanessa · 2021-10-19T23:20:15Z

Hi eurochriskelly,

Thank you for filing this issue! I'm wondering whether you are able to provide sample data for us to reproduce the bug?

Thanks,
Vanessa

jmakeig assigned yunzvanessa Jun 26, 2020

jmakeig added the bug label Jun 26, 2020

yunzvanessa added this to the 10.0.6 milestone Sep 12, 2020

yunzvanessa added minor mlcp new labels Sep 12, 2020

yunzvanessa modified the milestones: 10.0.6, 10.0.7 Jan 28, 2021

yunzvanessa modified the milestones: 10.0.7, 10.0.8 May 22, 2021

yunzvanessa assigned abika5 and unassigned yunzvanessa Sep 27, 2021

yunzvanessa modified the milestones: 10.0.8, 10.0.9 Sep 27, 2021

yunzvanessa added verify and removed new labels Oct 18, 2021

abika5 modified the milestones: 10.0.9, 10.0-10 Jan 28, 2022

yunzvanessa modified the milestones: 11.0.0, 11.1.0 May 15, 2023

abika5 modified the milestones: 11.1.0, 11.2.0 Jan 3, 2024

abika5 modified the milestones: 11.3.0, 11.4.0 Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLCP skips bad records without reporting when using splits #147

MLCP skips bad records without reporting when using splits #147

eurochriskelly commented Jun 26, 2020 •

edited

Loading

yunzvanessa commented Oct 19, 2021

MLCP skips bad records without reporting when using splits #147

MLCP skips bad records without reporting when using splits #147

Comments

eurochriskelly commented Jun 26, 2020 • edited Loading

Summary

Repro

yunzvanessa commented Oct 19, 2021

eurochriskelly commented Jun 26, 2020 •

edited

Loading