Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WarcRecordWriter to write and index WAT/WET files #9

Open
5 tasks
sebastian-nagel opened this issue Jul 4, 2019 · 0 comments
Open
5 tasks

WarcRecordWriter to write and index WAT/WET files #9

sebastian-nagel opened this issue Jul 4, 2019 · 0 comments

Comments

@sebastian-nagel
Copy link

Currently, the fetcher writes only WARC and CDX files while WAT and WET files are generated from the WARC files using the (CC's fork of the webarchive-commons library). In-lining the WAT/WET generation would allow to add the WAT/WET record offsets to the CDX index. This is a frequent wish from Common Crawl users (eg. 1, 2, 3).

Running the WAT/WET generation in the fetcher requires that it runs sufficiently fast and absolutely robust, otherwise crawled data is lost.

  • profile Fetcher reducer and WARC writer, improve performance, see WarcRecordWriter performance improvements #8
  • profile WAT/WET extractor and improve performance, see /WAT/WET generator performance improvements ia-web-commons#15. Note: if ready data structures are used instead of re-reading WARC records (see next point) the WAT/WET extraction should be faster without any changes.
  • make WAT/WET extraction (WEATGenerator, ResourceFactory implementations, see mapper) callable without the need to pass the WARC record as argument:
    • avoid decompressing and parsing of the WARC record
    • use ready objects instead: payload byte[], HTTP headers
    • detect charset once, use it for language detection and WAT/WET extraction
    • make use of objects not present in WARC response records (eg. store the detected language in WET files)
    • (in the long term) add non-HTML documents (PDF, office) to WET (WAT?)
  • push improvements upstream from ia-web-commons to webarchive-commons
  • add WAT/WET record offsets and lengths to CDX
    • WAT files contain also records for WARC request and metadata records - skip these?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant