Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to read terabyte dataset in .gzip format #119

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

msharmavikram
Copy link

Current data_utils has a bug that limits the reading of the terabyte dataset. It throws an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

This fix address this issue by allowing gzip files to be opened correctly.

Current data_utils has a bug that limits the reading of the terabyte dataset. It throws an error: 
`UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte`

This fix address this issue by allowing gzip files to be opened correctly.
@facebook-github-bot
Copy link

Hi @msharmavikram!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@msharmavikram
Copy link
Author

msharmavikram commented Aug 10, 2020

Addressed the Facebook CLA Check issue.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 10, 2020
@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@mnaumovfb
Copy link
Contributor

Thank you for looking into the data loader and proposing an update to it. I don't believe this is a bug. The code expects that the downloaded zipped files are first unzipped and then operates on the resulting text files (this is mentioned in the README).

However, an important consideration is a speed of reading the data. Do you know whether reading zipped files is faster than unzipped text files? and by how much?

@msharmavikram
Copy link
Author

@mnaumovfb thanks for looking at my changes. After taking a look at the code a bit more, I would like to reject this PR I recreated. Ofcourse reading .gzip read is faster than text reader especially when the dataset is stored in slow media like HDD. There is definitely some benefit but it is only incremental and no big deal.

Having said that, I believe what is worth changing is how the code is doing data prep. Here is the issue: the high-level pseudo code of data_utils.py is something of this sort -

for i in days:
     read(day_$i.gz) to find totallinecount and totallinecountperfile    - limited by read bw
For i in days:
     read(day_$i) to write(day_$i.npz) with minor data cleaning and insert missing fields.  -- limited by read/write bw
For i in days:
     read(day_$i.npz) to write(day_$i_processed.npz) do calculate pre-computed dictionaries.  -- limited by read/write bw

This pseudo-code is great if you are working on a few GBs of data but with a TB of data, like in the case of Crieteo Dataset, this is fundamentally limited by IO bandwidth, and performing repeated reads to this large dataset during data prep stage takes days to complete. I am considering addressing this issue now even though this is a one-time process to create .npz file.

I have a rough idea of how I can merge the second and third for-loop to one loop thus effectively improving the overall performance by at least 33%. I want to merge the first loop too but haven't figured out a simple way to do it.
I believe we can also do parallel reads (until we saturate disk IO bandwidth) to gain further performance beyond 33%. Something to think about that too while making changes to the code.

@mnaumovfb
Copy link
Contributor

mnaumovfb commented Aug 14, 2020

The pre-processing is indeed an important part of the code.

There is a reason why the code pre-processing is broken into three stages, notice that we can restart the pre-processing on stage boundaries and this can be helpful if the preprocessing is interrupted. However, if you are able to completely merge the last two stages it might be helpful (notice that you also need to support --memory-map flag to reduce the memory consumption; and you also have an option of using binary loader --mlperf-bin-loader).

Finally, the parallel pre-processing of multiple days of the code has also been discussed in #58 and the following PR has been recently proposed for it #117 This parallelization is not as trivial as it looks #117 (comment), but I'm reviewing the proposal in #117 which looks promising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants