Releases · webrecorder/browsertrix-crawler

21 Jul 06:28

ikreymer

0.4.0

6a65ea7

Browsertix Crawler 0.4.0

This release includes many new features, including:

YAML based config, specifiable via --config property or via stdin (with --config stdin)
Support for different scope types ('page', 'prefix', 'host', 'any', 'none') + crawl depth at crawl level
Per-Seed scoping, including different scope types, or depth and include/exclude rules configurable per seed in 'seeds' list via YAML config
Support for 'blockRules' for blocking certain URLs from being stored in WARCs, conditional blocking for iframe based on contents, and iframe URLs (see README for more details)
Interactive profile creation: creating profiles by interacting with embedded browser loaded in the browser (see README for more details).
Screencasting: streaming the output of each window via websocket-based streaming, configurable with --screencastPort option
New 'window' based parallelization: Open each worker in new window in same session
Simplified custom driver config, default calls 'loadPage'
Refactor arg parsing, other auxiliary functions into separate utils files
Image customization: support for customizing browser image, eg. building with Chromium instead of Chrome, support for ARM architecture builds (see README for more details).
Update to latest pywb (2.5.0b4), browsertrix-behaviors (0.2.3), py-wacz (0.3.1)

Assets 2

28 Jun 22:07

ikreymer

0.4.0-beta.2

ef7d5e5

Browsertix Crawler 0.4.0 Beta 2 Pre-release

Pre-release

Support for per-seed scoping (#63)
YAML Config:

Fixes for behavior, other options to work with YAML config
Support passing YAML config via stdin
New Docker Image, support for customizing browser image (support for multi-arch builds)

Assets 2

24 Jun 22:23

ikreymer

0.4.0b1

3ebe511

Browsertix Crawler 0.4.0 Beta 1 Pre-release

Pre-release

Support for screencasting mode for debugging with --screencastPort options.
Support for YAML-based config of all options, including specifying multiple seeds via --seeds or seeds key.

Assets 2

13 May 15:13

ikreymer

0.3.2

63376ab

Browsertrix Crawler 0.3.2

Changes for this version:

Added a --urlFile option: Allows users to specify a text file which contains a list of exact URLs to crawl (one URL per line).

Released image published to DockerHub at webrecorder/browsertrix-crawler:0.3.2

Assets 2

04 May 20:43

ikreymer

0.3.1

2db7bc9

Browsertrix Crawler 0.3.1

Features Include:

Improved shutdown wait: Instead of waiting for 5 secs, wait until all pending requests are written to WARCs (#47, #44)
Link extraction includes links in all frames (#48, #45)
Bug fix: Use async APIs for combine WARC to avoid spurious issues with multiple crawls (#49, #50)
Behaviors Update to Behaviors to 0.2.1, with support for facebook pages (#46)

Released image published to DockerHub at webrecorder/browsertrix-crawler:0.3.1

Assets 2

14 Apr 22:58

ikreymer

0.3.0

dba4524

Browsertrix Crawler 0.3.0

New features include:

--combineWARC and --rolloverSize for generating combined single WARC
Support for creating and running crawl with a login profile tarball (see README for more info)
Support for using Browsertrix Behaviors v0.1.1 for in-page behaviors
Customizable logging options via --logging, including behavior log, behavior debug log, pywb log and crawl stats (default)

Published to DockerHub at webrecorder/browsertrix-crawler:0.3.0

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: webrecorder/browsertrix-crawler

Browsertix Crawler 0.4.0

Browsertix Crawler 0.4.0 Beta 2

Browsertix Crawler 0.4.0 Beta 1

Browsertrix Crawler 0.3.2

Browsertrix Crawler 0.3.1

Browsertrix Crawler 0.3.0