WarcRecordWriter to write and index WAT/WET files #9

sebastian-nagel · 2019-07-04T13:42:18Z

Currently, the fetcher writes only WARC and CDX files while WAT and WET files are generated from the WARC files using the (CC's fork of the webarchive-commons library). In-lining the WAT/WET generation would allow to add the WAT/WET record offsets to the CDX index. This is a frequent wish from Common Crawl users (eg. 1, 2, 3).

Running the WAT/WET generation in the fetcher requires that it runs sufficiently fast and absolutely robust, otherwise crawled data is lost.

profile Fetcher reducer and WARC writer, improve performance, see WarcRecordWriter performance improvements #8
profile WAT/WET extractor and improve performance, see /WAT/WET generator performance improvements ia-web-commons#15. Note: if ready data structures are used instead of re-reading WARC records (see next point) the WAT/WET extraction should be faster without any changes.
make WAT/WET extraction (WEATGenerator, ResourceFactory implementations, see mapper) callable without the need to pass the WARC record as argument:
- avoid decompressing and parsing of the WARC record
- use ready objects instead: payload byte[], HTTP headers
- detect charset once, use it for language detection and WAT/WET extraction
- make use of objects not present in WARC response records (eg. store the detected language in WET files)
- (in the long term) add non-HTML documents (PDF, office) to WET (WAT?)
push improvements upstream from ia-web-commons to webarchive-commons
add WAT/WET record offsets and lengths to CDX
- WAT files contain also records for WARC request and metadata records - skip these?

The text was updated successfully, but these errors were encountered:

sebastian-nagel added enhancement help wanted labels Jul 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WarcRecordWriter to write and index WAT/WET files #9

WarcRecordWriter to write and index WAT/WET files #9

sebastian-nagel commented Jul 4, 2019

WarcRecordWriter to write and index WAT/WET files #9

WarcRecordWriter to write and index WAT/WET files #9

Comments

sebastian-nagel commented Jul 4, 2019