Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent issue using wget-created .warc.gz files #20

Open
ZoeB opened this issue May 15, 2015 · 1 comment
Open

Apparent issue using wget-created .warc.gz files #20

ZoeB opened this issue May 15, 2015 · 1 comment

Comments

@ZoeB
Copy link

ZoeB commented May 15, 2015

Hi!

First of all, thank you for writing this, it's very useful!

It looks like it has an issue parsing the wget-created .warc.gz files I give it, though:

Traceback (most recent call last):
File "./find-broken-links.py", line 16, in
for record in file:
File "/Library/Python/2.7/site-packages/warc/warc.py", line 393, in iter
record = self.read_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 359, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "/Library/Python/2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'software: Wget/1.14 (linux-gnueabihf)\r\n'
-1

Alas, I suspect fixing this elegantly is probably out of my depth. Is this something you can do?

Thank you,
Zoë.

@john-hewitt
Copy link

john-hewitt commented Jun 6, 2016

Bump on this. WET and WAT files are handled fine, but the library fails to read through the entirety of a record in the function read_record for raw WARCs, so the expects fail in finish_reading_current_record.

Attached is a WARC with which I have seen this behavior.

10001.warc.gz

EDIT : A potential problem is the method of writing your WARCs. I think my error might be due to how I handle unicode when downloading/writing the WARCs, since decode errors handled by replacing the character may change the byte count of the content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants