A spider crawler extension for Pimcore Dynamic Search.
Release | Supported Pimcore Versions | Supported Symfony Versions | Release Date | Maintained | Branch |
---|---|---|---|---|---|
3.x | 11.0 |
^6.2 |
28.09.2023 | Feature Branch | master |
2.x | 10.0 - 10.6 |
^5.4 |
19.12.2021 | No | 2.x |
1.x | 6.6 - 6.9 |
^4.4 |
18.04.2021 | No | 1.x |
"require" : {
"dachcom-digital/dynamic-search" : "~3.0.0",
"dachcom-digital/dynamic-search-data-provider-crawler" : "~3.0.0"
}
You need to install / enable the Dynamic Search Bundle first. Read more about it here. After that, proceed as followed:
Add Bundle to bundles.php
:
<?php
return [
\DsWebCrawlerBundle\DsWebCrawlerBundle::class => ['all' => true],
];
dynamic_search:
context:
default:
data_provider:
service: 'web_crawler'
options:
always:
own_host_only: true
full_dispatch:
seed: 'http://your-domain.test'
valid_links:
- '@^http://your-domain.test.*@i'
user_invalid_links:
- '@^http://your-domain.test\/members.*@i'
single_dispatch:
host: 'http://your-domain.test.test'
normalizer:
service: 'web_crawler_localized_resource_normalizer'
Name | Default Value | Description |
---|---|---|
own_host_only |
false | |
allow_subdomains |
false | |
allow_query_in_url |
false | |
allow_hash_in_url |
false | |
allowed_mime_types |
['text/html', 'application/pdf'] | |
allowed_schemes |
['http'] | |
content_max_size |
0 |
Name | Default Value | Description |
---|---|---|
seed |
null | |
valid_links |
[] | |
user_invalid_links |
[] | |
max_link_depth |
15 | |
max_crawl_limit |
0 |
Name | Default Value | Description |
---|---|---|
host |
null |
Identifier: web_crawler_default_resource_normalizer
Normalize simple documents
Options: none
Identifier: web_crawler_localized_resource_normalizer
Scaffold localized documents
Options:
Name | Default Value | Allowed Type | Description |
---|---|---|---|
locales |
all pimcore enabled languages | array | |
skip_not_localized_documents |
true | bool | if false, an exception rises if a document/object has no valid locale |
Identifier: http_response_html_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource
with content-type text/html
.
Identifier: http_response_pdf_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource
with content-type application/pdf
.
Identifier: pimcore_element_scaffolder
Simple object scaffolder.
Supported types: Asset
, Document
, DataObject\Concrete
.
Identifier: resource_uri_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Options: none
Identifier: resource_language_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Options: none
Identifier: resource_meta_extractor
Supported Scaffolder: http_response_html_scaffolder
Return Type: string|null
Options:
Name | Default Value | Allowed Type | Description |
---|---|---|---|
name |
null | string | The name of the meta tag to fetch the value from |
Identifier: resource_html_tag_content_extractor
Supported Scaffolder: http_response_html_scaffolder
Return Type: string|null
Options: none
Identifier: resource_text_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Name | Default Value | Allowed Type | Description |
---|---|---|---|
content_start_indicator |
<!-- main-content --> |
string | Marks the begin of the indexable page content |
content_end_indicator |
<!-- /main-content --> |
string | Marks the end of the indexable page conten |
content_exclude_start_indicator |
null | null|string | Marks the begin of the text to be excluded from indexing |
content_exclude_end_indicator |
null | null|string | Marks the end of the text to be excluded from indexing |
Identifier: resource_title_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Options: none
Copyright: DACHCOM.DIGITAL
For licensing details please visit LICENSE.md
Before updating, please check our upgrade notes!