Fix to read terabyte dataset in .gzip format #119

msharmavikram · 2020-08-10T03:28:49Z

Current data_utils has a bug that limits the reading of the terabyte dataset. It throws an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

This fix address this issue by allowing gzip files to be opened correctly.

Current data_utils has a bug that limits the reading of the terabyte dataset. It throws an error: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte` This fix address this issue by allowing gzip files to be opened correctly.

Fixed typo.

facebook-github-bot · 2020-08-10T03:28:58Z

Hi @msharmavikram!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

msharmavikram · 2020-08-10T03:32:10Z

Addressed the Facebook CLA Check issue.

facebook-github-bot · 2020-08-10T03:41:30Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

mnaumovfb · 2020-08-12T20:54:17Z

Thank you for looking into the data loader and proposing an update to it. I don't believe this is a bug. The code expects that the downloaded zipped files are first unzipped and then operates on the resulting text files (this is mentioned in the README).

However, an important consideration is a speed of reading the data. Do you know whether reading zipped files is faster than unzipped text files? and by how much?

msharmavikram · 2020-08-14T04:21:19Z

@mnaumovfb thanks for looking at my changes. After taking a look at the code a bit more, I would like to reject this PR I recreated. Ofcourse reading .gzip read is faster than text reader especially when the dataset is stored in slow media like HDD. There is definitely some benefit but it is only incremental and no big deal.

Having said that, I believe what is worth changing is how the code is doing data prep. Here is the issue: the high-level pseudo code of data_utils.py is something of this sort -

for i in days:
     read(day_$i.gz) to find totallinecount and totallinecountperfile    - limited by read bw
For i in days:
     read(day_$i) to write(day_$i.npz) with minor data cleaning and insert missing fields.  -- limited by read/write bw
For i in days:
     read(day_$i.npz) to write(day_$i_processed.npz) do calculate pre-computed dictionaries.  -- limited by read/write bw

This pseudo-code is great if you are working on a few GBs of data but with a TB of data, like in the case of Crieteo Dataset, this is fundamentally limited by IO bandwidth, and performing repeated reads to this large dataset during data prep stage takes days to complete. I am considering addressing this issue now even though this is a one-time process to create .npz file.

I have a rough idea of how I can merge the second and third for-loop to one loop thus effectively improving the overall performance by at least 33%. I want to merge the first loop too but haven't figured out a simple way to do it.
I believe we can also do parallel reads (until we saturate disk IO bandwidth) to gain further performance beyond 33%. Something to think about that too while making changes to the code.

mnaumovfb · 2020-08-14T05:53:57Z

The pre-processing is indeed an important part of the code.

There is a reason why the code pre-processing is broken into three stages, notice that we can restart the pre-processing on stage boundaries and this can be helpful if the preprocessing is interrupted. However, if you are able to completely merge the last two stages it might be helpful (notice that you also need to support --memory-map flag to reduce the memory consumption; and you also have an option of using binary loader --mlperf-bin-loader).

Finally, the parallel pre-processing of multiple days of the code has also been discussed in #58 and the following PR has been recently proposed for it #117 This parallelization is not as trivial as it looks #117 (comment), but I'm reviewing the proposal in #117 which looks promising.

msharmavikram added 2 commits August 9, 2020 22:23

Update data_utils.py

aae22de

Fixed typo.

Update data_utils.py

39f929b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 10, 2020

mnaumovfb mentioned this pull request Sep 6, 2022

How long does data processing take? #251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix to read terabyte dataset in .gzip format #119

Fix to read terabyte dataset in .gzip format #119

msharmavikram commented Aug 10, 2020

facebook-github-bot commented Aug 10, 2020

msharmavikram commented Aug 10, 2020 •

edited

Loading

facebook-github-bot commented Aug 10, 2020

mnaumovfb commented Aug 12, 2020

msharmavikram commented Aug 14, 2020

mnaumovfb commented Aug 14, 2020 •

edited

Loading

Fix to read terabyte dataset in .gzip format #119

Are you sure you want to change the base?

Fix to read terabyte dataset in .gzip format #119

Conversation

msharmavikram commented Aug 10, 2020

facebook-github-bot commented Aug 10, 2020

msharmavikram commented Aug 10, 2020 • edited Loading

facebook-github-bot commented Aug 10, 2020

mnaumovfb commented Aug 12, 2020

msharmavikram commented Aug 14, 2020

mnaumovfb commented Aug 14, 2020 • edited Loading

msharmavikram commented Aug 10, 2020 •

edited

Loading

mnaumovfb commented Aug 14, 2020 •

edited

Loading