You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to warc2text this massive (29G) file on cirrus /beegfs/paracrawl/data/ia/wide00015-warcs/WIDE-20170107025349-crawl808/WIDE-20170107025349-02068.warc.gz it is killed by the OOM killer.
I know this is a design choice, and it's not really high priority, but it would be nice if there would be a way to deal with it without skipping the whole warc file. E.g.
partially read a record so we can parse the header and estimate whether it is of interest, and if not just skip over the rest of the record, not trying to store it in memory.
or easier, have WARCReader::getRecord skip a record if, while reading & deflating, it starts to become large.
The text was updated successfully, but these errors were encountered:
I have implemented the easy fix for now, but it is definitely a good idea to refactor in the future and parse the WARC header before reading the entire body
I'm not sure what the max size of the record should be (I set it to 20MB), so feel free to change that value
When trying to warc2text this massive (29G) file on cirrus
/beegfs/paracrawl/data/ia/wide00015-warcs/WIDE-20170107025349-crawl808/WIDE-20170107025349-02068.warc.gz
it is killed by the OOM killer.I know this is a design choice, and it's not really high priority, but it would be nice if there would be a way to deal with it without skipping the whole warc file. E.g.
WARCReader::getRecord
skip a record if, while reading & deflating, it starts to become large.The text was updated successfully, but these errors were encountered: