You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the fetcher writes only WARC and CDX files while WAT and WET files are generated from the WARC files using the (CC's fork of the webarchive-commons library). In-lining the WAT/WET generation would allow to add the WAT/WET record offsets to the CDX index. This is a frequent wish from Common Crawl users (eg. 1, 2, 3).
Running the WAT/WET generation in the fetcher requires that it runs sufficiently fast and absolutely robust, otherwise crawled data is lost.
profile WAT/WET extractor and improve performance, see /WAT/WET generator performance improvements ia-web-commons#15. Note: if ready data structures are used instead of re-reading WARC records (see next point) the WAT/WET extraction should be faster without any changes.
make WAT/WET extraction (WEATGenerator, ResourceFactory implementations, see mapper) callable without the need to pass the WARC record as argument:
avoid decompressing and parsing of the WARC record
use ready objects instead: payload byte[], HTTP headers
detect charset once, use it for language detection and WAT/WET extraction
make use of objects not present in WARC response records (eg. store the detected language in WET files)
(in the long term) add non-HTML documents (PDF, office) to WET (WAT?)
Currently, the fetcher writes only WARC and CDX files while WAT and WET files are generated from the WARC files using the (CC's fork of the webarchive-commons library). In-lining the WAT/WET generation would allow to add the WAT/WET record offsets to the CDX index. This is a frequent wish from Common Crawl users (eg. 1, 2, 3).
Running the WAT/WET generation in the fetcher requires that it runs sufficiently fast and absolutely robust, otherwise crawled data is lost.
byte[]
, HTTP headersThe text was updated successfully, but these errors were encountered: