-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Documentation] WebDriverHttpFetcher capabilities examples #1017
Comments
Hello @jetnet, WebDrivers are maintained external to the our web crawler by their publishers and their capabilities vary. You'll need to refer to your specific web driver implementation for configuration details. Luckily, you can configure your WebDriverHttpFetcher to use any "capabilities" offered by your driver via: <capabilities>
<capability name="(capability name)">(capability value)</capability>
<!-- multiple "capability" tags allowed -->
</capabilities> On the other hand, whether your web driver supports all the requirements you are after is to be seen. I suggest you check with the community behind your web driver. For instance, Mozilla seems to support "trust all certificates" as described here. It would translate to this in your config: <capability name="acceptInsecureCerts">true</capability> For proxy settings, you can refer to #799. Depending on your WebDriver, something like this may do it: <capability name="proxy.proxyType">MANUAL</capability>
<capability name="proxy.httpProxy">proxy_address:proxy_port"</capability>
... and so on ... For setting the user agent and having custom headers, I am not sure what your webdriver can do, but in any case, you can configure the "httpSniffer" within your fetcher to set those: <httpSniffer>
<userAgent>(optionally overwrite browser user agent)</userAgent>
<headers>
<!-- You can repeat this header tag as needed. -->
<header name="(header name)">(header value)</header>
</headers>
</httpSniffer> More elaborate options are also possible, like using Java to create your modified version of the WebDriverHttpFetcher, or you can inject JavaScript into the crawled pages to perform extra customization (look at Refer to WebDriverHttpFetcher for other of configuration options. |
Hello Pascal, thank you very much for the quick reply. I was not able to set up any capability using the latest Selenium docker images (firefox or chrome). |
Finally, I did some progress on that, using a Docker image: docker run -d \
--network=host \
--add-host host:127.0.0.1 \
-e SE_START_XVFB=false \
-e SE_SESSION_REQUEST_TIMEOUT=60 \
-e SE_NODE_MAX_SESSIONS=5 \
--shm-size="2g" --name firefox selenium/standalone-firefox Notes:
The issues follow... one issue per comment, please let me know, if I should open dedicated tickets for them. |
Collector opens two sessions to the webdriver, but closes only one, the remaining session is blocked and waiting to time out:
Expected behavior: the first (main?) thread should shutdown the web driver as well. |
HttpSniffer configures its own proxy for the webdriver. How to use a "real" proxy then?Probably, it'd be an enhancement request: HttpProxy should support an external (customer's) proxy server. Crawled URL: https://ifconfig.io/all I tried to set up the standard proxy env vars, hoping, that the HttpSniffer's libs would use them, but it did not help: export http_proxy=http://local-proxy:8118
export https_proxy=http://local-proxy:8118 The content field shows the Remote address which is NOT from the configured proxy. |
HttpSniffer does not use the provided Trusted cert storeNorconex start params: java -Dlog4j.configurationFile="file:${EXT_CONFIG_DIR}/test/log4j2.xml" \
-Xms2G -Xmx10G \
-Dnashorn.args=--no-deprecation-warning \
-Djavax.net.ssl.trustStore=${TRUST_STORE} \
-Dfile.encoding=UTF8 -Duser.country=US -Duser.language=en \
-cp "${HTTP_DIR}/lib/*:${HTTP_DIR}/classes:${ES_DIR}/lib/*:${EXT_LIB}/*" \
com.norconex.collector.http.HttpCollector "$@" I thought, it used the default (standard) trust store path:
and copied my custom
|
HttpSniffer ignore
|
Firefox capabilities ignoredConfiguration: <httpFetchers>
<fetcher class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
<browser>firefox</browser> <!-- NOTE: there must be more than one node session available !!! -->
<remoteURL>http://localhost:4444</remoteURL>
<capabilities>
<capability name="TESTCAP">TESTVAL</capability>
<capability name="general.useragent.override"HTTP-Collector</capability>
<capability name="network.proxy.type">1</capability>
<capability name="network.proxy.http">192.168.178.23</capability>
<capability name="network.proxy.http_port">8118</capability>
</capabilities>
</fetcher>
</httpFetchers> Web driver session info: http://localhost:4444/status "session": {
"capabilities": {
"acceptInsecureCerts": true,
"browserName": "firefox",
"browserVersion": "126.0",
"moz:accessibilityChecks": false,
"moz:buildID": "20240509170740",
"moz:debuggerAddress": "127.0.0.1:27816",
"moz:firefoxOptions": {
"args": [
"-headless"
],
"profile": "...base64 encoded zipped user.js ..."
},
Encoded and unzipped |
suggested config <capability name="proxy.proxyType">MANUAL</capability>
<capability name="proxy.httpProxy">proxy_address:proxy_port"</capability> has no effect on the created session (http://localhost:4444/status): "session": {
"capabilities": {
...
"proxy": {
},
... |
Chrome webdriver testsStart local proxy, e.g. (listen on docker run --name='tor-privoxy' -d \
--network=host \
dockage/tor-privoxy:latest Start Chrome webdriver (listen on docker run -d \
--network=host \
--add-host host:127.0.0.1 \
-e SE_START_XVFB=false \
-e SE_SESSION_REQUEST_TIMEOUT=60 \
-e SE_NODE_MAX_SESSIONS=5 \
--shm-size="2g" --name chrome selenium/standalone-chrome |
Chrome webdriver - no custom capabilities can be set at allTried: <browser>chrome</browser>
<remoteURL>http://localhost:4444</remoteURL>
<capabilities>
<capability name="TESTCAP">TESTVAL</capability>
<capability name="proxy.http">http://localhost:8118</capability>
<capability name="proxy.https">http://localhost:8118</capability>
<capability name="proxy.no_proxy">localhost,127.0.0.1</capability>
</capabilities> and <browser>chrome</browser>
<remoteURL>http://localhost:4444</remoteURL>
<capabilities>
<capability name="TESTCAP">TESTVAL</capability>
<capability name="proxy.httpProxy">localhost:8118</capability>
<capability name="proxy.sslProxy">localhost:8118</capability>
<capability name="proxy.noProxy">localhost,127.0.0.1</capability>
<capability name="proxy.proxyType">MANUAL</capability>
</capabilities> in both cases the webdriver status page (http://localhost:4444) does not show any capabilities from above. NOTE: |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
It would be great if there would be examples in the documentation how to set various web-driver capabilities, in particular:
true
,false
If someone has working examples could you please share it here?
Thank you!
The text was updated successfully, but these errors were encountered: