PHP Spider

URL spider which crawls a page and all its subpages

Installation
Usage
Processors
URL Handlers
Alternatives

Installation

Make sure you have Composer installed. Then execute:

composer require baqend/spider

This package requires at least PHP 5.5.9 and has no package dependencies!

Usage

The entry point is the Spider class. For it to work, it requires the following services:

Queue: Collects URLs to be processed. This package comes with a breadth-first and a depth-first implementation.
URL Handler: Checks if a URL should be processed. If no URL handler is provided, every URL is processed. More about URL handlers
Downloader: Takes URLs and downloads them. To have no dependency on a HTTP client library like Guzzle, you have to implement this class by yourself.
Processor: Retrieves downloaded assets and performs operations on it. More about Processors

You initialize the spider in the following way:

<?php
use Baqend\Component\Spider\Processor;
use Baqend\Component\Spider\Queue\BreadthQueue;
use Baqend\Component\Spider\Spider;
use Baqend\Component\Spider\UrlHandler\BlacklistUrlHandler;

// Use the breadth-first queue
$queue = new BreadthQueue();

// Implement the DownloaderInterface
$downloader /* your downloader implementation */;

// Create a URL handler, e.g. the provided blacklist URL handler
$urlHandler = new BlacklistUrlHandler(['**.php']);

// Create some processors which will be executed after another
// More details on the processors below!
$processor = new Processor\Processor();
$processor->addProcessor(new Processor\UrlRewriteProcessor('https://example.org', 'https://example.com/archive'));
$processor->addProcessor($cssProcessor = new Processor\CssProcessor());
$processor->addProcessor(new Processor\HtmlProcessor($cssProcessor));
$processor->addProcessor(new Processor\ReplaceProcessor('https://example.org', 'https://example.com/archive'));
$processor->addProcessor(new Processor\StoreProcessor('https://example.com/archive', '/tmp/output'));

// Create the spider instance
$spider = new Spider($queue, $downloader, $urlHandler, $processor);

// Enqueue some URLs
$spider->queue('https://example.org/index.html');
$spider->queue('https://example.org/news/other-landingpage.html');

// Execute the crawling
$spider->crawl();

Processors

This package comes with the following built-in processors.

`Processor`

This is an aggregate processor which allows adding and removing other processors which it will execute one after the other.

<?php
use Baqend\Component\Spider\Processor\Processor;

$processor = new Processor();
$processor->addProcessor($firstProcessor);
$processor->addProcessor($secondProcessor);
$processor->addProcessor($thirdProcessor);

// This will call `process` on $firstProcessor, $secondProcessor, and finally on $thirdProcessor:
$processor->process($asset, $queue);

`HtmlProcessor`

This processor can process HTML assets and enqueue its containing URLs. It will also modify all relative URLs and make them absolute. Also, if you provide a CssProcessor, style attributes are found and URLs within CSS will be resolved.

`CssProcessor`

This processor can process CSS assets and enqueue its containing URLs from @imports and url(...) statements.

`ReplaceProcessor`

Performs simple str_replace operations on asset contents:

<?php
use Baqend\Component\Spider\Processor\ReplaceProcessor;

$processor = new ReplaceProcessor('Hello World', 'Hallo Welt');

// This will replace all occurrences of
// "Hello World" in the asset with "Hallo Welt":
$processor->process($asset, $queue);

The ReplaceProcessor does not enqueue other URLs.

`StoreProcessor`

Takes a URL prefix and a directory and will store all assets relative to the prefix in the according file structure in directory.

The StoreProcessor does not enqueue other URLs.

`UrlRewriteProcessor`

Changes the URL of an asset to another prefix. Use this to let HtmlProcessor and CssProcessor resolve relative URLs from a different origin.

The UrlRewriteProcessor does not enqueue other URLs. Also, it does not modify the asset's content – only its URL.

URL Handlers

URL handlers tell the spider whether to download and process a URL. There are the following built-in URL handlers:

`OriginUrlHandler`

Handles only URLs coming from some given origin, i.e. "https://example.org".

`BlacklistUrlHandler`

Does not handle URLs being part of some blacklist. You can use glob patterns to provide a blacklist:

<?php
use Baqend\Component\Spider\UrlHandler\BlacklistUrlHandler;

$blacklist = [
    'https://other.org/**',     // Don't handle anything from other.org over HTTPS    
    'http{,s}://other.org/**',  // Don't handle anything from other.org over HTTP or HTTPS    
    '**.{png,gif,jpg,jpeg}',    // Don't handle any image files    
];

$urlHandler = new BlacklistUrlHandler($blacklist);

Alternatives

If this project does not match your needs, check the following other projects:

spatie/crawler (Requires PHP 7)
vdb/php-spider

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Downloader		Downloader
Processor		Processor
Queue		Queue
Tests		Tests
UrlHandler		UrlHandler
.gitignore		.gitignore
Asset.php		Asset.php
LICENSE		LICENSE
README.md		README.md
Spider.php		Spider.php
UrlException.php		UrlException.php
UrlHelper.php		UrlHelper.php
composer.json		composer.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PHP Spider

Installation

Usage

Processors

`Processor`

`HtmlProcessor`

`CssProcessor`

`ReplaceProcessor`

`StoreProcessor`

`UrlRewriteProcessor`

URL Handlers

`OriginUrlHandler`

`BlacklistUrlHandler`

Alternatives

About

Releases

Packages

Languages

License

Baqend/PHP-Spider

Folders and files

Latest commit

History

Repository files navigation

PHP Spider

Installation

Usage

Processors

Processor

HtmlProcessor

CssProcessor

ReplaceProcessor

StoreProcessor

UrlRewriteProcessor

URL Handlers

OriginUrlHandler

BlacklistUrlHandler

Alternatives

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Processor`

`HtmlProcessor`

`CssProcessor`

`ReplaceProcessor`

`StoreProcessor`

`UrlRewriteProcessor`

`OriginUrlHandler`

`BlacklistUrlHandler`

Packages