Standardize cookie handling #933

Mantisus · 2025-01-24T13:11:40Z

Currently we have 3 main cookie handling mechanisms depending on the HTTP client or browser, and none work correctly.

HttpxHttpClient.
This solution is closest to expected. Cookies are stored in Session. However, we use dict which loses the cookie-domain relationship. This can cause issues during cross-domain crawling.
CurlImpersonateHttpClient.
Session knows nothing about cookies and all cookies are stored at the AsyncSession level. As a result, if we don't use proxies, all sessions have identical cookies. If we work with proxies, cookies become tied to the proxy.
Playwright.
Session knows nothing about cookies and all cookies are stored at the PlaywrightContext level, meaning all sessions working from one context will operate with the same cookies.

The text was updated successfully, but these errors were encountered:

B4nan · 2025-01-27T10:59:17Z

@Mantisus thanks for bringing this up. Let's split it into 3 separate PRs please.

janbuchar · 2025-01-27T12:34:37Z

Regarding 1, this will probably involve changing the Session so that it uses a more sophisticated cookie jar implementation. I'm not sure if this can be made backwards compatible...

B4nan · 2025-01-27T12:47:40Z

I would focus on 3 first, that feels like the biggest issue to me. Scraping multiple domains in a single crawler is not a very common use case.

janbuchar · 2025-01-27T13:55:43Z

I would focus on 3 first, that feels like the biggest issue to me. Scraping multiple domains in a single crawler is not a very common use case.

Agreed, even though 2. is similar in terms of severity (but yeah, playwright is a bit more popular)

Mantisus · 2025-01-30T14:21:32Z

2 of 3. But for multi-domain cookie support, we'd really have to go to something like http.cookiejar.CookieJar instead of dict.

…in `PlaywrightCrawler` (#941) ### Description - Improve cookie handling for `PlaywrightCrawler`. Cookies are now stored in the `Session` and set in Playwright Context from the `Session`. - Add `use_incognito_pages` option for `browser_launch_options` allowing each new page to be launched in a separate context. ### Issues - #722 - #933

### Description - fix cookie handling. Behavior alignment with `HttpxHttpClient`. ### Issues - #933

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 24, 2025

Mantisus changed the title ~~Standardize Cookie Handling~~ Standardize cookie handling Jan 24, 2025

Mantisus added the bug Something isn't working. label Jan 24, 2025

B4nan assigned Mantisus Jan 27, 2025

This was referenced Jan 29, 2025

feat: add support use_incognito_pages for browser_launch_options in PlaywrightCrawler #941

Merged

fix: fix CurlImpersonateHttpClient cookies handler #946

Merged

Mantisus mentioned this issue Feb 5, 2025

Breaking changes for v0.6 #906

Open

vdusek pushed a commit that referenced this issue Feb 5, 2025

fix: fix CurlImpersonateHttpClient cookies handler (#946)

ed415c4

### Description - fix cookie handling. Behavior alignment with `HttpxHttpClient`. ### Issues - #933

vdusek added this to the 107th sprint - Tooling team milestone Feb 5, 2025

Mantisus linked a pull request Feb 13, 2025 that will close this issue

refactor!: change Session cookies from dict to SessionCookies with CookieJar #984

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize cookie handling #933

Standardize cookie handling #933

Mantisus commented Jan 24, 2025

B4nan commented Jan 27, 2025

janbuchar commented Jan 27, 2025

B4nan commented Jan 27, 2025

janbuchar commented Jan 27, 2025

Mantisus commented Jan 30, 2025 •

edited

Loading

Standardize cookie handling #933

Standardize cookie handling #933

Comments

Mantisus commented Jan 24, 2025

B4nan commented Jan 27, 2025

janbuchar commented Jan 27, 2025

B4nan commented Jan 27, 2025

janbuchar commented Jan 27, 2025

Mantisus commented Jan 30, 2025 • edited Loading

Mantisus commented Jan 30, 2025 •

edited

Loading