.gz WARC files not properly read #21

MrMagoffin · 2015-06-15T13:12:39Z

When reading WARC files compressed with gzip, many of the entries contained are skipped or misread. To reproduce, use common crawl data in .gz format, count the number of entries found by the WARC library and then count the number of appearances of WARC/1.0 in the file. It is a very large difference.

laura-dietz · 2015-06-25T13:12:31Z

Are your entries skipped, or does it stop after the first n entries?

I am fighting a similar issue, using a WARC file created by python's warc - when I read it (again with python warc, it only reads the first 249 entries and stops)

brenreyes · 2015-07-09T16:08:47Z

Hello,

I am having the same issue. In my case it does not seem to skip any records, just reads up to a certain number and then stop. This makes it difficult for me to work with larger warc files.

MrMagoffin · 2015-07-10T02:17:31Z

Yep, that's the problem. Stops after the first 200 or so entries.

brenreyes · 2015-07-10T17:00:55Z

I am posting a sample file that illustrates what I am talking about:
http://webarchive.library.unt.edu/thumbs/UNT-sample.wat.warc.gz

This is a WAT fomat, but it will work with warc.py because a WAT is essentially a WARC.
I am trying to open it by using something like:

warc_file = open (UNT-sample.wat.warc.gz, "rb", "warc")
for record in warc_file:
print record [WARC-Record-ID]

If you look at the output you'll see the last element it reads has WARC-Record-ID of "urn:uuid:b12431f1-b946-417f-bee0-babdc123f265", this is located approximately 12% into the file. So the code ignores the rest of the records.

Hope this helps.

everilae · 2016-03-23T12:40:16Z

It seems that WARCReader stops short when it reads None instead of new headers. I'm quick to blame the custom GzipFile handling, because

import gzip
import warc

with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
    for record in warc.WARCFile(fileobj=gzf):
        print record.payload.read()

works just fine.

sebastian-nagel · 2016-10-19T08:38:07Z

Work-around to read a gzipped WARC file completely:

import os
import warc

f = warc.open(warcfile)
fsize = os.path.getsize(warcfile)

while fsize > f.tell():
    for record in f:
        ...

RahulGuptaIIITA · 2017-09-21T19:08:54Z

@sebastian-nagel @everilae
I'm trying to parse warc file,

and Im getting this error
self.finish_reading_current_record()

File "lib/python2.7/site-packages/warc/warc.py", line 359, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "lib/python2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'WARC/1.0\r\n'

Pasting below the code snippet.
file_name = "XYZ-WARCS.warc.gz"
with gzip.open(file_name, mode='rb') as gzf:
for record in warc.WARCFile(fileobj=gzf):
print record.payload.read()

meshiguge · 2017-12-18T00:45:49Z

@sebastian-nagel could I random access WRAC file or .gz WARC file with seek-like functions in this package ?

kartheek7895 · 2018-02-15T08:35:33Z

f.seek(offset) seems to not work on a WARC file.Any workarounds possible?

sebastian-nagel · 2018-02-15T17:15:09Z

could be done instead using warcio, see extract_record

Afe95 · 2019-01-31T21:30:49Z

I have the same problem with the file crawl-data/CC-MAIN-2017-17/segments/1492917118310.2/warc/CC-MAIN-20170423031158-00113-ip-10-145-167-34.ec2.internal.warc.gz.

Traceback (most recent call last):
  File "mapper_new.py", line 54, in <module>
    for record in f:
  File ".venv2/local/lib/python2.7/site-packages/warc/warc.py", line 393, in __iter__
    record = self.read_record()
  File ".venv2/local/lib/python2.7/site-packages/warc/warc.py", line 364, in read_record
    self.finish_reading_current_record()
  File ".venv2/local/lib/python2.7/site-packages/warc/warc.py", line 359, in finish_reading_current_record
    self.expect(self.current_payload.fileobj, "\r\n")
  File ".venv2/local/lib/python2.7/site-packages/warc/warc.py", line 352, in expect
    raise IOError(message)
IOError: Expected '\r\n', found ''

I am using the utility GzipStreamFile: should I avoid using it?

My version of warc library is 0.2.1 and I am using Python 2.7

is reached after WARC record is entirely consumed, fixes internetarchive#21

sebastian-nagel mentioned this issue Oct 19, 2016

News WARC files processing issue. commoncrawl/news-crawl#11

Closed

ivanistheone mentioned this issue Feb 12, 2021

Unsupported WARC version: 1.1 #34

Open

sebastian-nagel added a commit to commoncrawl/warc that referenced this issue Aug 23, 2021

Continue to read gzipped WARC file in case end of gzip member

9d1ef7d

is reached after WARC record is entirely consumed, fixes internetarchive#21

sebastian-nagel mentioned this issue Aug 23, 2021

Entirely read gzipped WARC files #35

Closed

sebastian-nagel added a commit to commoncrawl/warc that referenced this issue Aug 23, 2021

Continue to read gzipped WARC file in case end of gzip member

713e75c

is reached after WARC record is entirely consumed, fixes internetarchive#21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gz WARC files not properly read #21

.gz WARC files not properly read #21

MrMagoffin commented Jun 15, 2015

laura-dietz commented Jun 25, 2015

brenreyes commented Jul 9, 2015

MrMagoffin commented Jul 10, 2015

brenreyes commented Jul 10, 2015

everilae commented Mar 23, 2016

sebastian-nagel commented Oct 19, 2016

RahulGuptaIIITA commented Sep 21, 2017 •

edited

Loading

meshiguge commented Dec 18, 2017

kartheek7895 commented Feb 15, 2018

sebastian-nagel commented Feb 15, 2018 •

edited

Loading

Afe95 commented Jan 31, 2019 •

edited

Loading

.gz WARC files not properly read #21

.gz WARC files not properly read #21

Comments

MrMagoffin commented Jun 15, 2015

laura-dietz commented Jun 25, 2015

brenreyes commented Jul 9, 2015

MrMagoffin commented Jul 10, 2015

brenreyes commented Jul 10, 2015

everilae commented Mar 23, 2016

sebastian-nagel commented Oct 19, 2016

RahulGuptaIIITA commented Sep 21, 2017 • edited Loading

meshiguge commented Dec 18, 2017

kartheek7895 commented Feb 15, 2018

sebastian-nagel commented Feb 15, 2018 • edited Loading

Afe95 commented Jan 31, 2019 • edited Loading

RahulGuptaIIITA commented Sep 21, 2017 •

edited

Loading

sebastian-nagel commented Feb 15, 2018 •

edited

Loading

Afe95 commented Jan 31, 2019 •

edited

Loading