-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix to read terabyte dataset in .gzip format #119
base: main
Are you sure you want to change the base?
Conversation
Current data_utils has a bug that limits the reading of the terabyte dataset. It throws an error: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte` This fix address this issue by allowing gzip files to be opened correctly.
Fixed typo.
Hi @msharmavikram! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
Addressed the Facebook CLA Check issue. |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
Thank you for looking into the data loader and proposing an update to it. I don't believe this is a bug. The code expects that the downloaded zipped files are first unzipped and then operates on the resulting text files (this is mentioned in the README). However, an important consideration is a speed of reading the data. Do you know whether reading zipped files is faster than unzipped text files? and by how much? |
@mnaumovfb thanks for looking at my changes. After taking a look at the code a bit more, I would like to reject this PR I recreated. Ofcourse reading .gzip read is faster than text reader especially when the dataset is stored in slow media like HDD. There is definitely some benefit but it is only incremental and no big deal. Having said that, I believe what is worth changing is how the code is doing data prep. Here is the issue: the high-level pseudo code of data_utils.py is something of this sort -
This pseudo-code is great if you are working on a few GBs of data but with a TB of data, like in the case of Crieteo Dataset, this is fundamentally limited by IO bandwidth, and performing repeated reads to this large dataset during data prep stage takes days to complete. I am considering addressing this issue now even though this is a one-time process to create .npz file. I have a rough idea of how I can merge the second and third for-loop to one loop thus effectively improving the overall performance by at least 33%. I want to merge the first loop too but haven't figured out a simple way to do it. |
The pre-processing is indeed an important part of the code. There is a reason why the code pre-processing is broken into three stages, notice that we can restart the pre-processing on stage boundaries and this can be helpful if the preprocessing is interrupted. However, if you are able to completely merge the last two stages it might be helpful (notice that you also need to support Finally, the parallel pre-processing of multiple days of the code has also been discussed in #58 and the following PR has been recently proposed for it #117 This parallelization is not as trivial as it looks #117 (comment), but I'm reviewing the proposal in #117 which looks promising. |
Current data_utils has a bug that limits the reading of the terabyte dataset. It throws an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
This fix address this issue by allowing gzip files to be opened correctly.