You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you for writing this, it's very useful!
It looks like it has an issue parsing the wget-created .warc.gz files I give it, though:
Traceback (most recent call last):
File "./find-broken-links.py", line 16, in
for record in file:
File "/Library/Python/2.7/site-packages/warc/warc.py", line 393, in iter
record = self.read_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 359, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "/Library/Python/2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'software: Wget/1.14 (linux-gnueabihf)\r\n'
-1
Alas, I suspect fixing this elegantly is probably out of my depth. Is this something you can do?
Thank you,
Zoë.
The text was updated successfully, but these errors were encountered:
Bump on this. WET and WAT files are handled fine, but the library fails to read through the entirety of a record in the function read_record for raw WARCs, so the expects fail in finish_reading_current_record.
Attached is a WARC with which I have seen this behavior.
EDIT : A potential problem is the method of writing your WARCs. I think my error might be due to how I handle unicode when downloading/writing the WARCs, since decode errors handled by replacing the character may change the byte count of the content.
Hi!
First of all, thank you for writing this, it's very useful!
It looks like it has an issue parsing the wget-created .warc.gz files I give it, though:
Traceback (most recent call last):
File "./find-broken-links.py", line 16, in
for record in file:
File "/Library/Python/2.7/site-packages/warc/warc.py", line 393, in iter
record = self.read_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "/Library/Python/2.7/site-packages/warc/warc.py", line 359, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "/Library/Python/2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'software: Wget/1.14 (linux-gnueabihf)\r\n'
-1
Alas, I suspect fixing this elegantly is probably out of my depth. Is this something you can do?
Thank you,
Zoë.
The text was updated successfully, but these errors were encountered: