Apparent issue using wget-created .warc.gz files #20

ZoeB · 2015-05-15T07:29:38Z

Hi!

First of all, thank you for writing this, it's very useful!

It looks like it has an issue parsing the wget-created .warc.gz files I give it, though:

Traceback (most recent call last):
File "./find-broken-links.py", line 16, in
for record in file:
File "/Library/Python/2.7/site-packages/warc/warc.py", line 393, in iter
record = self.read_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 359, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "/Library/Python/2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'software: Wget/1.14 (linux-gnueabihf)\r\n'
-1

Alas, I suspect fixing this elegantly is probably out of my depth. Is this something you can do?

Thank you,
Zoë.

john-hewitt · 2016-06-06T13:14:21Z

Bump on this. WET and WAT files are handled fine, but the library fails to read through the entirety of a record in the function read_record for raw WARCs, so the expects fail in finish_reading_current_record.

Attached is a WARC with which I have seen this behavior.

10001.warc.gz

EDIT : A potential problem is the method of writing your WARCs. I think my error might be due to how I handle unicode when downloading/writing the WARCs, since decode errors handled by replacing the character may change the byte count of the content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apparent issue using wget-created .warc.gz files #20

Apparent issue using wget-created .warc.gz files #20

ZoeB commented May 15, 2015

john-hewitt commented Jun 6, 2016 •

edited

Loading

Apparent issue using wget-created .warc.gz files #20

Apparent issue using wget-created .warc.gz files #20

Comments

ZoeB commented May 15, 2015

john-hewitt commented Jun 6, 2016 • edited Loading

john-hewitt commented Jun 6, 2016 •

edited

Loading