Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.gz WARC files not properly read #21

Open
MrMagoffin opened this issue Jun 15, 2015 · 11 comments
Open

.gz WARC files not properly read #21

MrMagoffin opened this issue Jun 15, 2015 · 11 comments

Comments

@MrMagoffin
Copy link

When reading WARC files compressed with gzip, many of the entries contained are skipped or misread. To reproduce, use common crawl data in .gz format, count the number of entries found by the WARC library and then count the number of appearances of WARC/1.0 in the file. It is a very large difference.

@laura-dietz
Copy link

Are your entries skipped, or does it stop after the first n entries?

I am fighting a similar issue, using a WARC file created by python's warc - when I read it (again with python warc, it only reads the first 249 entries and stops)

@brenreyes
Copy link

Hello,

I am having the same issue. In my case it does not seem to skip any records, just reads up to a certain number and then stop. This makes it difficult for me to work with larger warc files.

@MrMagoffin
Copy link
Author

Yep, that's the problem. Stops after the first 200 or so entries.

@brenreyes
Copy link

I am posting a sample file that illustrates what I am talking about:
http://webarchive.library.unt.edu/thumbs/UNT-sample.wat.warc.gz

This is a WAT fomat, but it will work with warc.py because a WAT is essentially a WARC.
I am trying to open it by using something like:

warc_file = open (UNT-sample.wat.warc.gz, "rb", "warc")
for record in warc_file:
print record [WARC-Record-ID]

If you look at the output you'll see the last element it reads has WARC-Record-ID of "urn:uuid:b12431f1-b946-417f-bee0-babdc123f265", this is located approximately 12% into the file. So the code ignores the rest of the records.

Hope this helps.

@everilae
Copy link

It seems that WARCReader stops short when it reads None instead of new headers. I'm quick to blame the custom GzipFile handling, because

import gzip
import warc

with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
    for record in warc.WARCFile(fileobj=gzf):
        print record.payload.read()

works just fine.

@sebastian-nagel
Copy link

Work-around to read a gzipped WARC file completely:

import os
import warc

f = warc.open(warcfile)
fsize = os.path.getsize(warcfile)

while fsize > f.tell():
    for record in f:
        ...

@RahulGuptaIIITA
Copy link

RahulGuptaIIITA commented Sep 21, 2017

@sebastian-nagel @everilae
I'm trying to parse warc file,

and Im getting this error
self.finish_reading_current_record()

File "lib/python2.7/site-packages/warc/warc.py", line 359, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "lib/python2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'WARC/1.0\r\n'

Pasting below the code snippet.
file_name = "XYZ-WARCS.warc.gz"
with gzip.open(file_name, mode='rb') as gzf:
for record in warc.WARCFile(fileobj=gzf):
print record.payload.read()

@meshiguge
Copy link

@sebastian-nagel could I random access WRAC file or .gz WARC file with seek-like functions in this package ?

@kartheek7895
Copy link

f.seek(offset) seems to not work on a WARC file.Any workarounds possible?

@sebastian-nagel
Copy link

sebastian-nagel commented Feb 15, 2018

could be done instead using warcio, see extract_record

@Afe95
Copy link

Afe95 commented Jan 31, 2019

I have the same problem with the file crawl-data/CC-MAIN-2017-17/segments/1492917118310.2/warc/CC-MAIN-20170423031158-00113-ip-10-145-167-34.ec2.internal.warc.gz.

Traceback (most recent call last):
  File "mapper_new.py", line 54, in <module>
    for record in f:
  File ".venv2/local/lib/python2.7/site-packages/warc/warc.py", line 393, in __iter__
    record = self.read_record()
  File ".venv2/local/lib/python2.7/site-packages/warc/warc.py", line 364, in read_record
    self.finish_reading_current_record()
  File ".venv2/local/lib/python2.7/site-packages/warc/warc.py", line 359, in finish_reading_current_record
    self.expect(self.current_payload.fileobj, "\r\n")
  File ".venv2/local/lib/python2.7/site-packages/warc/warc.py", line 352, in expect
    raise IOError(message)
IOError: Expected '\r\n', found ''

I am using the utility GzipStreamFile: should I avoid using it?

My version of warc library is 0.2.1 and I am using Python 2.7

sebastian-nagel added a commit to commoncrawl/warc that referenced this issue Aug 23, 2021
is reached after WARC record is entirely consumed, fixes internetarchive#21
sebastian-nagel added a commit to commoncrawl/warc that referenced this issue Aug 23, 2021
is reached after WARC record is entirely consumed, fixes internetarchive#21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants