Releases: webrecorder/browsertrix-crawler
Releases · webrecorder/browsertrix-crawler
Browsertix Crawler 0.4.0
This release includes many new features, including:
- YAML based config, specifiable via
--config
property or via stdin (with--config stdin
) - Support for different scope types ('page', 'prefix', 'host', 'any', 'none') + crawl depth at crawl level
- Per-Seed scoping, including different scope types, or depth and include/exclude rules configurable per seed in 'seeds' list via YAML config
- Support for 'blockRules' for blocking certain URLs from being stored in WARCs, conditional blocking for iframe based on contents, and iframe URLs (see README for more details)
- Interactive profile creation: creating profiles by interacting with embedded browser loaded in the browser (see README for more details).
- Screencasting: streaming the output of each window via websocket-based streaming, configurable with --screencastPort option
- New 'window' based parallelization: Open each worker in new window in same session
- Simplified custom driver config, default calls 'loadPage'
- Refactor arg parsing, other auxiliary functions into separate utils files
- Image customization: support for customizing browser image, eg. building with Chromium instead of Chrome, support for ARM architecture builds (see README for more details).
- Update to latest pywb (2.5.0b4), browsertrix-behaviors (0.2.3), py-wacz (0.3.1)
Browsertix Crawler 0.4.0 Beta 2
Support for per-seed scoping (#63)
YAML Config:
- Fixes for behavior, other options to work with YAML config
- Support passing YAML config via stdin
New Docker Image, support for customizing browser image (support for multi-arch builds)
Browsertix Crawler 0.4.0 Beta 1
Support for screencasting mode for debugging with --screencastPort
options.
Support for YAML-based config of all options, including specifying multiple seeds via --seeds
or seeds
key.
Browsertrix Crawler 0.3.2
Changes for this version:
- Added a
--urlFile option
: Allows users to specify a text file which contains a list of exact URLs to crawl (one URL per line).
Released image published to DockerHub at webrecorder/browsertrix-crawler:0.3.2
Browsertrix Crawler 0.3.1
Features Include:
- Improved shutdown wait: Instead of waiting for 5 secs, wait until all pending requests are written to WARCs (#47, #44)
- Link extraction includes links in all frames (#48, #45)
- Bug fix: Use async APIs for combine WARC to avoid spurious issues with multiple crawls (#49, #50)
- Behaviors Update to Behaviors to 0.2.1, with support for facebook pages (#46)
Released image published to DockerHub at webrecorder/browsertrix-crawler:0.3.1
Browsertrix Crawler 0.3.0
New features include:
--combineWARC
and--rolloverSize
for generating combined single WARC- Support for creating and running crawl with a login profile tarball (see README for more info)
- Support for using Browsertrix Behaviors v0.1.1 for in-page behaviors
- Customizable logging options via
--logging
, including behavior log, behavior debug log, pywb log and crawl stats (default)
Published to DockerHub at webrecorder/browsertrix-crawler:0.3.0