Skip to content

Commit

Permalink
Per-Seed Scoping Rules + Crawl Depth (#63)
Browse files Browse the repository at this point in the history
* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list

arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'

* scope: fix typo in 'prefix' scope

* update browsertrix-behaviors to 0.2.2

* tests: add test for passing config via stdin, also adding --excludes via cmdline

* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
  • Loading branch information
ikreymer authored Jun 26, 2021
1 parent f57818f commit ef7d5e5
Show file tree
Hide file tree
Showing 10 changed files with 398 additions and 247 deletions.
85 changes: 65 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,33 +52,43 @@ crawler [options]
Options:
--help Show help [boolean]
--version Show version number [boolean]
-u, --url The URL to start crawling from
[string]
--urlFile, --urlfile, --url-file, If set, read a list of urls from the
--url-list passed file INSTEAD of the url from
the --url flag. [string]
--seeds, --url The URL to start crawling from
[array] [default: []]
--seedFile, --urlFile If set, read a list of seed urls,
one per line, from the specified
[string]
-w, --workers The number of workers to run in
parallel [number] [default: 1]
--newContext The context for each new capture,
can be a new: page, session or
browser. [string] [default: "page"]
can be a new: page, window, session
or browser.
[string] [default: "page"]
--waitUntil Puppeteer page.goto() condition to
wait for before continuing, can be
multiple separate by ','
[default: "load,networkidle0"]
--depth The depth of the crawl for all seeds
[number] [default: -1]
--limit Limit crawl to this number of pages
[number] [default: 0]
--timeout Timeout for each page to load (in
seconds) [number] [default: 90]
--scope Regex of page URLs that should be
included in the crawl (defaults to
the immediate directory of URL)
--scopeType Simplified scope for which URLs to
crawl, can be: prefix, page, host,
any [string] [default: "prefix"]
--exclude Regex of page URLs that should be
excluded from the crawl.
--allowHashUrls Allow Hashtag URLs, useful for
single-page-application crawling or
when different hashtags load dynamic
content
-c, --collection Collection name to crawl to (replay
will be accessible under this name
in pywb preview)
[string] [default: "capture-YYYY-MM-DDTHH-MM-SS"]
[string] [default: "capture-2021-06-26T19-38-10"]
--headless Run in headless mode, otherwise
start xvfb[boolean] [default: false]
--driver JS driver for the crawler
Expand Down Expand Up @@ -124,18 +134,11 @@ Options:
--profile Path to tar.gz file which will be
extracted and used as the browser
profile [string]
--screencastPort If set to a non-zero value, starts
an HTTP server with screencast
accessible on this port
[number] [default: 0]
--yamlConfig A path to a yaml file.
If set browsertrix will
attempt to parse and use the parameters set in
the yaml file passed. Values set in the
command line will take precedence.
[string]
--config Path to YAML config file
```
</details>

Expand All @@ -148,16 +151,22 @@ while `--waitUntil networkidle0` may make sense for dynamic sites.

### YAML Crawl Config

Browsertix Crawler supports the use of a yaml file to set parameters for a crawl. This can be used by passing a valid yaml file to the --yamlConfig.
Browsertix Crawler supports the use of a yaml file to set parameters for a crawl. This can be used by passing a valid yaml file to the `--config` option.

The YAML file can contain the same parameters as the command-line arguments. If a parameter is set on the command-line and in the yaml file, the value from the command-line will be used.
The YAML file can contain the same parameters as the command-line arguments. If a parameter is set on the command-line and in the yaml file, the value from the command-line will be used. For example, the following should start a crawl with config in `crawl-config.yaml` (where [VERSION] represents the version of browsertrix-crawler image you're working with).

The config file must be passed as a volume. You can run a command similar to this.

```
docker run -v $PWD/crawl.yaml:/app/crawl.yaml -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler:[VERSION] crawl —-yamlConfig /app/sample.yaml
docker run -v $PWD/crawl-config.yaml:/app/crawl-config.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl —-config /app/crawl-config.yaml
```

The config can also be passed via stdin, which can simplify the command. Note that this require running `docker run` with the `-i` flag. To read config from stdin, pass `--config stdin`

```
cat ./crawl-config.yaml | docker run -i -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl —-config stdin
```


An example config file might contain:

```
Expand All @@ -168,6 +177,42 @@ seeds:
combineWARCs: true
```

The list of seeds can be loaded via an external file by specifying the filename via the `seedFile` config or command-line option.

#### Per Seed Settings

Certain settings such scope type, scope includes and excludes, and depth can be configured per seed, for example:

```
seeds:
- url: https://webrecorder.net/
depth: 1
type: "prefix"
```

### Scope Types

The crawl scope can be configured globally for all seeds, or customized per seed, by specifying the `--scopeType` command-line option or setting the `type` property for each seed.

The scope controls which linked pages are also included in the crawl.

The available types are:

- `page` - crawl only this page, but load any links that include different hashtags. Useful for single-page apps that may load different content based on hashtag.

- `prefix` - crawl any pages in the same directory, eg. starting from `https://example.com/path/page.html`, crawl anything under `https://example.com/path/` (default)

- `host` - crawl pages that share the same host.

- `any` - crawl any and all pages.

- `none` - don't crawl any additional pages besides the seed.


The `depth` setting also limits how many pages will be crawled for that seed, while the `limit` option sets the total
number of pages crawled from any seed.


### Behaviors

Browsertrix Crawler also supports automatically running customized in-browser behaviors. The behaviors auto-play videos (when possible),
Expand Down
Loading

0 comments on commit ef7d5e5

Please sign in to comment.