Skip to content

Releases: webrecorder/browsertrix-crawler

Browsertix Crawler 0.4.0

21 Jul 06:28
Compare
Choose a tag to compare

This release includes many new features, including:

  • YAML based config, specifiable via --config property or via stdin (with --config stdin)
  • Support for different scope types ('page', 'prefix', 'host', 'any', 'none') + crawl depth at crawl level
  • Per-Seed scoping, including different scope types, or depth and include/exclude rules configurable per seed in 'seeds' list via YAML config
  • Support for 'blockRules' for blocking certain URLs from being stored in WARCs, conditional blocking for iframe based on contents, and iframe URLs (see README for more details)
  • Interactive profile creation: creating profiles by interacting with embedded browser loaded in the browser (see README for more details).
  • Screencasting: streaming the output of each window via websocket-based streaming, configurable with --screencastPort option
  • New 'window' based parallelization: Open each worker in new window in same session
  • Simplified custom driver config, default calls 'loadPage'
  • Refactor arg parsing, other auxiliary functions into separate utils files
  • Image customization: support for customizing browser image, eg. building with Chromium instead of Chrome, support for ARM architecture builds (see README for more details).
  • Update to latest pywb (2.5.0b4), browsertrix-behaviors (0.2.3), py-wacz (0.3.1)

Browsertix Crawler 0.4.0 Beta 2

28 Jun 22:07
ef7d5e5
Compare
Choose a tag to compare
Pre-release

Support for per-seed scoping (#63)
YAML Config:

  • Fixes for behavior, other options to work with YAML config
  • Support passing YAML config via stdin
    New Docker Image, support for customizing browser image (support for multi-arch builds)

Browsertix Crawler 0.4.0 Beta 1

24 Jun 22:23
Compare
Choose a tag to compare
Pre-release

Support for screencasting mode for debugging with --screencastPort options.
Support for YAML-based config of all options, including specifying multiple seeds via --seeds or seeds key.

Browsertrix Crawler 0.3.2

13 May 15:13
63376ab
Compare
Choose a tag to compare

Changes for this version:

  • Added a --urlFile option: Allows users to specify a text file which contains a list of exact URLs to crawl (one URL per line).

Released image published to DockerHub at webrecorder/browsertrix-crawler:0.3.2

Browsertrix Crawler 0.3.1

04 May 20:43
Compare
Choose a tag to compare

Features Include:

  • Improved shutdown wait: Instead of waiting for 5 secs, wait until all pending requests are written to WARCs (#47, #44)
  • Link extraction includes links in all frames (#48, #45)
  • Bug fix: Use async APIs for combine WARC to avoid spurious issues with multiple crawls (#49, #50)
  • Behaviors Update to Behaviors to 0.2.1, with support for facebook pages (#46)

Released image published to DockerHub at webrecorder/browsertrix-crawler:0.3.1

Browsertrix Crawler 0.3.0

14 Apr 22:58
Compare
Choose a tag to compare

New features include:

  • --combineWARC and --rolloverSize for generating combined single WARC
  • Support for creating and running crawl with a login profile tarball (see README for more info)
  • Support for using Browsertrix Behaviors v0.1.1 for in-page behaviors
  • Customizable logging options via --logging, including behavior log, behavior debug log, pywb log and crawl stats (default)

Published to DockerHub at webrecorder/browsertrix-crawler:0.3.0