Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize cookie handling #933

Open
Mantisus opened this issue Jan 24, 2025 · 5 comments · May be fixed by #984
Open

Standardize cookie handling #933

Mantisus opened this issue Jan 24, 2025 · 5 comments · May be fixed by #984
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@Mantisus
Copy link
Collaborator

Currently we have 3 main cookie handling mechanisms depending on the HTTP client or browser, and none work correctly.

  1. HttpxHttpClient.
    This solution is closest to expected. Cookies are stored in Session. However, we use dict which loses the cookie-domain relationship. This can cause issues during cross-domain crawling.

  2. CurlImpersonateHttpClient.
    Session knows nothing about cookies and all cookies are stored at the AsyncSession level. As a result, if we don't use proxies, all sessions have identical cookies. If we work with proxies, cookies become tied to the proxy.

  3. Playwright.
    Session knows nothing about cookies and all cookies are stored at the PlaywrightContext level, meaning all sessions working from one context will operate with the same cookies.

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 24, 2025
@Mantisus Mantisus changed the title Standardize Cookie Handling Standardize cookie handling Jan 24, 2025
@Mantisus Mantisus added the bug Something isn't working. label Jan 24, 2025
@B4nan
Copy link
Member

B4nan commented Jan 27, 2025

@Mantisus thanks for bringing this up. Let's split it into 3 separate PRs please.

@janbuchar
Copy link
Collaborator

Regarding 1, this will probably involve changing the Session so that it uses a more sophisticated cookie jar implementation. I'm not sure if this can be made backwards compatible...

@B4nan
Copy link
Member

B4nan commented Jan 27, 2025

I would focus on 3 first, that feels like the biggest issue to me. Scraping multiple domains in a single crawler is not a very common use case.

@janbuchar
Copy link
Collaborator

I would focus on 3 first, that feels like the biggest issue to me. Scraping multiple domains in a single crawler is not a very common use case.

Agreed, even though 2. is similar in terms of severity (but yeah, playwright is a bit more popular)

@Mantisus
Copy link
Collaborator Author

Mantisus commented Jan 30, 2025

2 of 3. But for multi-domain cookie support, we'd really have to go to something like http.cookiejar.CookieJar instead of dict.

vdusek pushed a commit that referenced this issue Feb 5, 2025
…in `PlaywrightCrawler` (#941)

### Description

- Improve cookie handling for `PlaywrightCrawler`. Cookies are now
stored in the `Session` and set in Playwright Context from the
`Session`.
- Add `use_incognito_pages` option for `browser_launch_options` allowing
each new page to be launched in a separate context.

### Issues

- #722 
- #933
vdusek pushed a commit that referenced this issue Feb 5, 2025
### Description

- fix cookie handling. Behavior alignment with `HttpxHttpClient`.

### Issues

- #933
@vdusek vdusek added this to the 107th sprint - Tooling team milestone Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants