Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Enable additional status codes arguments to PlaywrightCrawler #959

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Feb 5, 2025

Description

Add additional_http_error_status_codes and ignore_http_error_status_codes to PlaywrightCrawler.
Since they exist now on all crawlers, move them to BasicCrawler level.
Do not use _http_client attributes for getting additional status codes related variables.

Breaking: Remove HttpCrawlerOptions -> No unique options compared to BasicCrawlerOptions anymore.

Issues

Since they exist now on all crawlers, move them to basic crawler level.
@github-actions github-actions bot added this to the 107th sprint - Tooling team milestone Feb 5, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Feb 5, 2025
@Pijukatel Pijukatel added the enhancement New feature or request. label Feb 6, 2025
@Pijukatel Pijukatel marked this pull request as ready for review February 6, 2025 12:15
@Pijukatel Pijukatel requested review from vdusek, Mantisus and janbuchar and removed request for vdusek and Mantisus February 6, 2025 12:15

if self._http_client.additional_blocked_status_codes != self._additional_http_error_status_codes:
raise ValueError(
'Used `additional_blocked_status_codes` argument does not match with with '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double with

Sorry, can't commit with the quick fix due to limited permissions.

Comment on lines +292 to +313
self._additional_http_error_status_codes = (
set(additional_http_error_status_codes) if additional_http_error_status_codes else set()
)
self._ignore_http_error_status_codes = (
set(ignore_http_error_status_codes) if ignore_http_error_status_codes else set()
)

self._http_client = http_client or HttpxHttpClient(
additional_http_error_status_codes=self._additional_http_error_status_codes,
ignore_http_error_status_codes=self._ignore_http_error_status_codes,
)

if self._http_client.additional_blocked_status_codes != self._additional_http_error_status_codes:
raise ValueError(
'Used `additional_blocked_status_codes` argument does not match with '
f'{self._http_client.additional_blocked_status_codes=}. They have to be the same.'
)
if self._http_client.ignore_http_error_status_codes != self._ignore_http_error_status_codes:
raise ValueError(
'Used `ignore_http_error_status_codes` argument does not match with '
f'{self._http_client.ignore_http_error_status_codes=}. They have to be the same.'
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just keep them only in the http_client instance? (PW Crawler has HTTP client as well)

Copy link
Contributor Author

@Pijukatel Pijukatel Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering that option, but it felt like misuse to me, especially when it comes to PlaywrightCrawler. PlaywrightCrawler is not using HTTP client for page.navigate so it would be really strange if it would use some attribute of this unrelated component to decide whether response status code of page.navigate is ok or not.
(Mentioned : #953 (comment))

But I see it looks like unnecessary code duplication, so I am not 100% happy with this either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ignore_http_error_status_codes and additional_http_error_status_codes arguments to PlaywrightCrawler
3 participants