Skip to content

dachcom-digital/pimcore-dynamic-search-data-provider-crawler

Repository files navigation

Dynamic Search | Data Provider: Web Crawler

Software License Latest Release Tests PhpStan

A spider crawler extension for Pimcore Dynamic Search.

Release Plan

Release Supported Pimcore Versions Supported Symfony Versions Release Date Maintained Branch
3.x 11.0 ^6.2 28.09.2023 Feature Branch master
2.x 10.0 - 10.6 ^5.4 19.12.2021 No 2.x
1.x 6.6 - 6.9 ^4.4 18.04.2021 No 1.x

Installation

"require" : {
    "dachcom-digital/dynamic-search" : "~3.0.0",
    "dachcom-digital/dynamic-search-data-provider-crawler" : "~3.0.0"
}

Dynamic Search Bundle

You need to install / enable the Dynamic Search Bundle first. Read more about it here. After that, proceed as followed:

Add Bundle to bundles.php:

<?php

return [
    \DsWebCrawlerBundle\DsWebCrawlerBundle::class => ['all' => true],
];

Basic Setup

dynamic_search:
    context:
        default:
            data_provider:
                service: 'web_crawler'
                options:
                    always:
                        own_host_only: true
                    full_dispatch:
                        seed: 'http://your-domain.test'
                        valid_links:
                            - '@^http://your-domain.test.*@i'
                        user_invalid_links:
                            - '@^http://your-domain.test\/members.*@i'
                    single_dispatch:
                        host: 'http://your-domain.test.test'
                normalizer:
                    service: 'web_crawler_localized_resource_normalizer'

Provider Options

always

Name Default Value Description
own_host_only false
allow_subdomains false
allow_query_in_url false
allow_hash_in_url false
allowed_mime_types ['text/html', 'application/pdf']
allowed_schemes ['http']
content_max_size 0

full_dispatch

Name Default Value Description
seed null
valid_links []
user_invalid_links []
max_link_depth 15
max_crawl_limit 0

single_dispatch

Name Default Value Description
host null

Resource Normalizer

DefaultResourceNormalizer

Identifier: web_crawler_default_resource_normalizer Normalize simple documents Options: none

LocalizedResourceNormalizer

Identifier: web_crawler_localized_resource_normalizer Scaffold localized documents

Options:

Name Default Value Allowed Type Description
locales all pimcore enabled languages array
skip_not_localized_documents true bool if false, an exception rises if a document/object has no valid locale

Transformer

Scaffolder

HttpResponseHtmlDataScaffolder

Identifier: http_response_html_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource with content-type text/html.

HttpResponsePdfDataScaffolder

Identifier: http_response_pdf_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource with content-type application/pdf.

PimcoreElementScaffolder

Identifier: pimcore_element_scaffolder
Simple object scaffolder.
Supported types: Asset, Document, DataObject\Concrete.

Field Transformer

UriExtractor

Identifier: resource_uri_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null
Options: none

LanguageExtractor

Identifier: resource_language_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null Options: none

MetaExtractor

Identifier: resource_meta_extractor
Supported Scaffolder: http_response_html_scaffolder

Return Type: string|null Options:

Name Default Value Allowed Type Description
name null string The name of the meta tag to fetch the value from
HtmlTagExtractor

Identifier: resource_html_tag_content_extractor
Supported Scaffolder: http_response_html_scaffolder

Return Type: string|null Options: none

TextExtractor

Identifier: resource_text_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null

Name Default Value Allowed Type Description
content_start_indicator <!-- main-content --> string Marks the begin of the indexable page content
content_end_indicator <!-- /main-content --> string Marks the end of the indexable page conten
content_exclude_start_indicator null null|string Marks the begin of the text to be excluded from indexing
content_exclude_end_indicator null null|string Marks the end of the text to be excluded from indexing
TitleExtractor

Identifier: resource_title_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null Options: none


Copyright and License

Copyright: DACHCOM.DIGITAL
For licensing details please visit LICENSE.md

Upgrade Info

Before updating, please check our upgrade notes!