-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.gz WARC files not properly read #21
Comments
Are your entries skipped, or does it stop after the first n entries? I am fighting a similar issue, using a WARC file created by python's warc - when I read it (again with python warc, it only reads the first 249 entries and stops) |
Hello, I am having the same issue. In my case it does not seem to skip any records, just reads up to a certain number and then stop. This makes it difficult for me to work with larger warc files. |
Yep, that's the problem. Stops after the first 200 or so entries. |
I am posting a sample file that illustrates what I am talking about: This is a WAT fomat, but it will work with warc.py because a WAT is essentially a WARC. warc_file = open (UNT-sample.wat.warc.gz, "rb", "warc") If you look at the output you'll see the last element it reads has WARC-Record-ID of "urn:uuid:b12431f1-b946-417f-bee0-babdc123f265", this is located approximately 12% into the file. So the code ignores the rest of the records. Hope this helps. |
It seems that
works just fine. |
Work-around to read a gzipped WARC file completely: import os
import warc
f = warc.open(warcfile)
fsize = os.path.getsize(warcfile)
while fsize > f.tell():
for record in f:
... |
@sebastian-nagel @everilae and Im getting this error File "lib/python2.7/site-packages/warc/warc.py", line 359, in finish_reading_current_record Pasting below the code snippet. |
@sebastian-nagel could I random access WRAC file or .gz WARC file with seek-like functions in this package ? |
f.seek(offset) seems to not work on a WARC file.Any workarounds possible? |
could be done instead using warcio, see extract_record |
I have the same problem with the file
I am using the utility My version of warc library is 0.2.1 and I am using Python 2.7 |
is reached after WARC record is entirely consumed, fixes internetarchive#21
is reached after WARC record is entirely consumed, fixes internetarchive#21
When reading WARC files compressed with gzip, many of the entries contained are skipped or misread. To reproduce, use common crawl data in .gz format, count the number of entries found by the WARC library and then count the number of appearances of WARC/1.0 in the file. It is a very large difference.
The text was updated successfully, but these errors were encountered: