Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privacy 2024 queries #3653

Open
wants to merge 114 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
114 commits
Select commit Hold shift + click to select a range
251654f
readme
max-ostapenko May 3, 2024
537aa53
copied 2022 SQLs over to update/review
max-ostapenko May 3, 2024
237a280
fixed link
max-ostapenko May 3, 2024
66cccb2
origin trials
max-ostapenko Jun 9, 2024
736a7ab
Bump puppeteer from 22.7.1 to 22.8.0 in /src (#3655)
dependabot[bot] May 7, 2024
86506a3
notebook + readme (#3652)
max-ostapenko May 7, 2024
99aee63
Bump pytest from 8.1.1 to 8.2.0 in /src (#3651)
dependabot[bot] May 7, 2024
f76f679
Translation of privacy chapter to Japanese (#3654)
ksakae1216 May 7, 2024
c29735a
Update Timestamps (#3657)
github-actions[bot] May 7, 2024
47811ca
2023 Performance (#3525)
rviscomi May 15, 2024
46dabdc
Bump puppeteer from 22.8.0 to 22.9.0 in /src (#3662)
dependabot[bot] May 19, 2024
743f560
Upgrade to web-vitals v4 (#3661)
rviscomi May 20, 2024
0983d6c
Bump pytest from 8.2.0 to 8.2.1 in /src (#3664)
dependabot[bot] May 21, 2024
42e4599
--- (#3665)
dependabot[bot] May 22, 2024
22ec347
Bump puppeteer from 22.9.0 to 22.10.0 in /src (#3668)
dependabot[bot] May 24, 2024
1eb5873
Bump jsdom from 24.0.0 to 24.1.0 in /src (#3669)
dependabot[bot] May 27, 2024
c566929
Typofix (#3670)
borisschapira May 28, 2024
2d4bceb
SQL and MD folders the 2024 Web Almanac (#3666)
ChrisBeeti May 29, 2024
2c87cf2
Bump prettier from 3.2.5 to 3.3.0 in /src (#3672)
dependabot[bot] Jun 4, 2024
d60de8c
Bump pytest from 8.2.1 to 8.2.2 in /src (#3673)
dependabot[bot] Jun 5, 2024
4f925b8
Bump prettier from 3.3.0 to 3.3.1 in /src (#3674)
dependabot[bot] Jun 5, 2024
846c710
Fix loaf monitoring bug (#3675)
tunetheweb Jun 5, 2024
524c51c
Update Timestamps (#3677)
github-actions[bot] Jun 5, 2024
6f441ff
Bump web-vitals from 4.0.1 to 4.1.0 in /src (#3678)
dependabot[bot] Jun 7, 2024
689d2ef
fixed link
max-ostapenko May 3, 2024
dd21cf0
remove unreviewed sql
max-ostapenko Jun 9, 2024
be50c36
Merge branch 'main' into privacy-sql-2024
max-ostapenko Jun 9, 2024
320ebbe
lint test
max-ostapenko Jun 9, 2024
21f612e
lint
max-ostapenko Jun 9, 2024
d0c2c35
ads supply graph
max-ostapenko Jul 14, 2024
27511d0
lint
max-ostapenko Jul 14, 2024
8cd1e83
close file
max-ostapenko Jul 14, 2024
da015db
lint
max-ostapenko Jul 14, 2024
b7179e4
top_direct_sellers
max-ostapenko Jul 20, 2024
4de4c61
ads_txt_lines_histogram
max-ostapenko Jul 20, 2024
de249eb
ads_txt_seller_accounts_by_type
max-ostapenko Jul 20, 2024
ddf2ba8
top_ads_variables
max-ostapenko Jul 20, 2024
4e24a59
format
max-ostapenko Jul 20, 2024
3552776
tcf2
max-ostapenko Jul 21, 2024
5cc3695
rename
max-ostapenko Jul 21, 2024
4796653
lint
max-ostapenko Jul 21, 2024
cd6cac0
using custom_metrics
max-ostapenko Jul 25, 2024
9d11fcc
most_common_cname_domains
max-ostapenko Jul 25, 2024
ab54d6a
adguard list
max-ostapenko Aug 4, 2024
d9242dd
gpc
max-ostapenko Aug 4, 2024
17c4455
referrer policy
max-ostapenko Aug 4, 2024
52d57b5
usp
max-ostapenko Aug 4, 2024
234ef27
iab frameworks
max-ostapenko Aug 5, 2024
0bce587
lint
max-ostapenko Aug 5, 2024
8c60240
bounce trackers
max-ostapenko Aug 5, 2024
b1b47bc
Added privacy sandbox related queries
Yash-Vekaria Aug 13, 2024
14136ae
lint
Yash-Vekaria Aug 13, 2024
d6b1db4
missed lint
Yash-Vekaria Aug 13, 2024
a83f88d
dnt
max-ostapenko Aug 14, 2024
cf99788
client hints
max-ostapenko Aug 14, 2024
7fc52f4
whotracksme update
max-ostapenko Aug 14, 2024
95dd276
lint
max-ostapenko Aug 14, 2024
23ce85b
referrer policy
max-ostapenko Aug 14, 2024
27e4d43
rank filter removed
max-ostapenko Aug 14, 2024
109e807
trackers
max-ostapenko Aug 15, 2024
d41de3d
util deps
max-ostapenko Aug 15, 2024
266fa78
limits
max-ostapenko Aug 15, 2024
b90332f
Privacy 2024 queries - CCPA, fingerprinting, cookies (#3720)
bstandaert-wustl Aug 15, 2024
29cccaf
bq to sheets updates
max-ostapenko Aug 15, 2024
b35e6de
query optimisation
max-ostapenko Aug 15, 2024
22a21ef
downgrade for python 3.8
max-ostapenko Aug 15, 2024
7ea017b
more categories
max-ostapenko Aug 15, 2024
ff429ff
more categories and columns reordered
max-ostapenko Aug 15, 2024
5afff7c
forms and formatted logs
max-ostapenko Aug 15, 2024
37c42d3
Refactoring queries to produce output for queries only
Yash-Vekaria Aug 15, 2024
0d39f6b
lint
max-ostapenko Aug 16, 2024
1c4e468
Merge branch 'main' into privacy-sql-2024
max-ostapenko Aug 16, 2024
a239c25
lint
max-ostapenko Aug 16, 2024
baf490d
Privacy Sql Tracking Detection Using Easylist Adservers (#3730)
hadiamjad Aug 16, 2024
4ab293f
log query errors
max-ostapenko Aug 17, 2024
3fb692e
Fixed privacy sandbox attestation query bug
Yash-Vekaria Aug 17, 2024
58dac23
maximum_bytes_billed parameter
max-ostapenko Aug 17, 2024
6f99ae6
moved to chapter root
max-ostapenko Aug 17, 2024
0b4898d
postpone dryrun check
max-ostapenko Aug 17, 2024
5445b92
fingerprinting_most_common_apis: improve resilience to malformed JSON…
bstandaert-wustl Aug 17, 2024
dac1167
optional maximum_bytes_billed parameter
max-ostapenko Aug 17, 2024
3d8cb6d
formatting
max-ostapenko Aug 18, 2024
e8a032a
queries and notebook updates
max-ostapenko Aug 18, 2024
82c084e
queries to rerun
max-ostapenko Aug 18, 2024
ed8944c
origin trials function fix
max-ostapenko Aug 19, 2024
bc6a045
optimised sellers count
max-ostapenko Aug 19, 2024
a917161
apps included in ads.txt lines
max-ostapenko Aug 19, 2024
c51a3e7
another rerun
max-ostapenko Aug 19, 2024
2792d67
lint
max-ostapenko Aug 19, 2024
b2a7f4f
no origins
max-ostapenko Aug 20, 2024
51a71f0
optimized perf
max-ostapenko Aug 20, 2024
23a72c7
more optimized perf
max-ostapenko Aug 20, 2024
c8450a0
graph optimization and OT expiration
max-ostapenko Aug 21, 2024
17ded3e
Merge remote-tracking branch 'origin/main' into privacy-sql-2024
max-ostapenko Aug 21, 2024
e29a3eb
earlier grouping for performance
max-ostapenko Aug 21, 2024
975f7c8
graph fixes
max-ostapenko Aug 21, 2024
fda33dd
cookies, ccpa, fingerprinting: calculate percent of total pages
bstandaert-wustl Aug 22, 2024
0c30a7a
query for top third-party cookie names
bstandaert-wustl Aug 24, 2024
cfde873
bq writer module
max-ostapenko Sep 18, 2024
ac6e895
add grouping
max-ostapenko Oct 1, 2024
fe31518
domain suffixes and regexes removed
max-ostapenko Oct 1, 2024
741b655
Merge remote-tracking branch 'origin/main' into privacy-sql-2024
max-ostapenko Oct 28, 2024
760ebed
add comments
max-ostapenko Oct 30, 2024
522ab70
review
max-ostapenko Oct 30, 2024
3d5a9cb
add PR link
max-ostapenko Oct 30, 2024
46390a5
lint
max-ostapenko Oct 30, 2024
b98454b
remove mobile filter
max-ostapenko Oct 30, 2024
858324e
lint
max-ostapenko Oct 30, 2024
129e36f
lint
max-ostapenko Oct 30, 2024
9bd5ea4
disable import-error rule
max-ostapenko Oct 30, 2024
dd0357a
adguard not used
max-ostapenko Oct 31, 2024
c374995
linting
max-ostapenko Oct 31, 2024
a81781c
pages_pct in query
max-ostapenko Oct 31, 2024
a646f8e
lint
max-ostapenko Oct 31, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions sql/2024/privacy/ads_accounts_distribution.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
WITH publishers AS (
SELECT
page,
JSON_QUERY(custom_metrics, '$.ads.ads.account_types') AS ads_account_types,
JSON_QUERY(custom_metrics, '$.ads.app_ads.account_types') AS app_ads_account_types
FROM `httparchive.all.pages`
WHERE date = '2024-06-01' AND
is_root_page = TRUE AND
(CAST(JSON_VALUE(custom_metrics, '$.ads.ads.account_count') AS INT64) > 0 OR
CAST(JSON_VALUE(custom_metrics, '$.ads.app_ads.account_count') AS INT64) > 0)
), ads_accounts AS (
SELECT
page,
CEIL(CAST(JSON_VALUE(ads_account_types, '$.direct.account_count') AS INT64) / 100) * 100 AS direct_account_count_bucket,
CEIL(CAST(JSON_VALUE(ads_account_types, '$.reseller.account_count') AS INT64) / 100) * 100 AS reseller_account_count_bucket,
COUNT(DISTINCT page) OVER () AS total_pages
FROM publishers
), app_ads_accounts AS (
SELECT
page,
CEIL(CAST(JSON_VALUE(app_ads_account_types, '$.direct.account_count') AS INT64) / 100) * 100 AS direct_account_count_bucket,
CEIL(CAST(JSON_VALUE(app_ads_account_types, '$.reseller.account_count') AS INT64) / 100) * 100 AS reseller_account_count_bucket,
COUNT(DISTINCT page) OVER () AS total_pages
FROM publishers
)

SELECT
'ads' AS source,
'direct' AS account_type,
direct_account_count_bucket AS account_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM ads_accounts
GROUP BY source, direct_account_count_bucket
UNION ALL
SELECT
'ads' AS source,
'reseller' AS account_type,
reseller_account_count_bucket AS account_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM ads_accounts
GROUP BY source, reseller_account_count_bucket
UNION ALL
SELECT
'app_ads' AS source,
'direct' AS account_type,
direct_account_count_bucket AS account_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM app_ads_accounts
GROUP BY source, direct_account_count_bucket
UNION ALL
SELECT
'app_ads' AS source,
'reseller' AS account_type,
reseller_account_count_bucket AS account_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM app_ads_accounts
GROUP BY source, reseller_account_count_bucket

ORDER BY account_count_bucket ASC
LIMIT 1000
114 changes: 114 additions & 0 deletions sql/2024/privacy/ads_and_sellers_graph.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
WITH RECURSIVE pages AS (
SELECT
CASE page -- Publisher websites may redirect to an SSP domain, and need to use redirected domain instead of page domain. CASE needs to be replaced with a more robust solution from HTTPArchive/custom-metrics#136.
WHEN 'https://www.chunkbase.com/' THEN 'cafemedia.com'
max-ostapenko marked this conversation as resolved.
Show resolved Hide resolved
ELSE NET.REG_DOMAIN(page)
END AS page_domain,
JSON_QUERY(ANY_VALUE(custom_metrics), '$.ads') AS ads_metrics
FROM `httparchive.all.pages`
WHERE date = '2024-06-01' AND
is_root_page = TRUE
GROUP BY page_domain
), ads AS (
SELECT
page_domain,
JSON_QUERY(ads_metrics, '$.ads.account_types') AS ad_accounts
FROM pages
WHERE
CAST(JSON_VALUE(ads_metrics, '$.ads.account_count') AS INT64) > 0
), sellers AS (
SELECT
page_domain,
JSON_QUERY(ads_metrics, '$.sellers.seller_types') AS ad_sellers
FROM pages
WHERE
CAST(JSON_VALUE(ads_metrics, '$.sellers.seller_count') AS INT64) > 0
), relationships_web AS (
SELECT
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS demand,
'Web' AS supply,
'direct' AS relationship,
page_domain AS publisher
FROM ads, UNNEST(JSON_VALUE_ARRAY(ad_accounts, '$.direct.domains')) AS domain
UNION ALL
SELECT
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS demand,
'Web' AS supply,
'indirect' AS relationship,
page_domain AS publisher
FROM ads, UNNEST(JSON_VALUE_ARRAY(ad_accounts, '$.reseller.domains')) AS domain
UNION ALL
SELECT
page_domain AS demand,
'Web' AS supply,
'direct' AS relationship,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS publisher
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.publisher.domains')) AS domain
UNION ALL
SELECT
page_domain AS demand,
'Web' AS supply,
'direct' AS relationship,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS publisher
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.both.domains')) AS domain
), relationships_adtech AS (
SELECT
page_domain AS demand,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS supply,
'indirect' AS relationship
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.intermediary.domains')) AS domain
UNION ALL
SELECT
page_domain AS demand,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS supply,
'indirect' AS relationship
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.both.domains')) AS domain
), nodes AS (
(
SELECT
demand,
supply,
CONCAT(demand, '-', supply) AS path,
relationship,
HLL_COUNT.INIT(publisher) AS supply_sketch
FROM relationships_web
GROUP BY demand, supply, relationship
)
UNION ALL
(
SELECT
relationships_grouped.demand AS demand,
relationships_grouped.supply AS supply,
CONCAT(relationships_grouped.demand, '-', nodes.path) AS path,
relationships_grouped.relationship AS relationship,
nodes.supply_sketch AS supply_sketch
FROM (
SELECT
demand,
supply,
relationship
FROM relationships_adtech
GROUP BY
demand,
supply,
relationship
) AS relationships_grouped
INNER JOIN nodes
ON relationships_grouped.supply = nodes.demand AND
nodes.supply_sketch IS NOT NULL AND
nodes.relationship = 'indirect' AND
relationships_grouped.demand IS NOT NULL AND
STRPOS(nodes.path, relationships_grouped.demand) = 0
)
)

SELECT
supply,
demand,
HLL_COUNT.MERGE(supply_sketch) AS publishers_count,
relationship,
path
FROM nodes
GROUP BY demand, supply, relationship, path
ORDER BY publishers_count DESC
LIMIT 5000
45 changes: 45 additions & 0 deletions sql/2024/privacy/ads_lines_distribution.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
WITH RECURSIVE pages AS (
SELECT
CASE page -- publisher websites may redirect to an SSP domain, and need to use redirected domain instead of page domain
WHEN 'https://www.chunkbase.com/' THEN 'cafemedia.com'
ELSE NET.REG_DOMAIN(page)
END AS page,
CAST(JSON_VALUE(custom_metrics, '$.ads.ads.line_count') AS INT64) AS ads_line_count,
CAST(JSON_VALUE(custom_metrics, '$.ads.app_ads.line_count') AS INT64) AS app_ads_line_count
FROM `httparchive.all.pages`
WHERE date = '2024-06-01' AND
is_root_page = TRUE
), ads AS (
SELECT
page,
CEIL(ads_line_count / 100) * 100 AS line_count_bucket,
COUNT(DISTINCT page) OVER () AS total_pages
FROM pages
WHERE ads_line_count > 0
), app_ads AS (
SELECT
page,
CEIL(app_ads_line_count / 100) * 100 AS line_count_bucket,
COUNT(DISTINCT page) OVER () AS total_pages
FROM pages
WHERE app_ads_line_count > 0
)

SELECT
'ads.txt' AS type,
line_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM ads
GROUP BY line_count_bucket
HAVING line_count_bucket <= 10000
UNION ALL
SELECT
'app-ads.txt' AS type,
line_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM app_ads
GROUP BY line_count_bucket
HAVING line_count_bucket <= 10000
ORDER BY type, line_count_bucket ASC
31 changes: 31 additions & 0 deletions sql/2024/privacy/ccpa_most_common_phrases.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
WITH pages_with_phrase AS (
SELECT
client,
rank_grouping,
page,
COUNT(DISTINCT page) OVER (PARTITION BY client, rank_grouping) AS total_pages_with_phrase_in_rank_group,
JSON_QUERY_ARRAY(custom_metrics, '$.privacy.ccpa_link.CCPALinkPhrases') AS ccpa_link_phrases
FROM `httparchive.all.pages`, --TABLESAMPLE SYSTEM (0.01 PERCENT)
UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS rank_grouping
WHERE date = '2024-06-01' AND
is_root_page = true AND
rank <= rank_grouping AND
array_length(JSON_QUERY_ARRAY(custom_metrics, '$.privacy.ccpa_link.CCPALinkPhrases')) > 0
)

SELECT
client,
rank_grouping,
link_phrase,
COUNT(DISTINCT page) AS num_pages,
COUNT(DISTINCT page) / any_value(total_pages_with_phrase_in_rank_group) AS pct_pages
FROM pages_with_phrase,
UNNEST(ccpa_link_phrases) AS link_phrase
GROUP BY
link_phrase,
rank_grouping,
client
ORDER BY
rank_grouping,
client,
num_pages DESC
27 changes: 27 additions & 0 deletions sql/2024/privacy/ccpa_prevalence.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
WITH pages AS (
SELECT
client,
rank_grouping,
page,
JSON_VALUE(custom_metrics, '$.privacy.ccpa_link.hasCCPALink') AS has_ccpa_link
FROM `httparchive.all.pages`, -- TABLESAMPLE SYSTEM (0.0025 PERCENT)
UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS rank_grouping
WHERE date = '2024-06-01' AND
is_root_page = true AND
rank <= rank_grouping
)

SELECT
client,
rank_grouping,
has_ccpa_link,
COUNT(DISTINCT page) AS num_pages
FROM pages
GROUP BY
has_ccpa_link,
rank_grouping,
client
ORDER BY
rank_grouping,
client,
has_ccpa_link
29 changes: 29 additions & 0 deletions sql/2024/privacy/common_ads_variables.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
WITH RECURSIVE pages AS (
SELECT
page,
JSON_QUERY(custom_metrics, '$.ads.ads') AS ads_metrics
FROM `httparchive.all.pages`
WHERE
date = '2024-06-01' AND
is_root_page = TRUE AND
CAST(JSON_VALUE(custom_metrics, '$.ads.ads.account_count') AS INT64) > 0
), ads AS (
SELECT
page,
variable,
COUNT(DISTINCT page) OVER() AS total_publishers
FROM pages,
UNNEST(JSON_VALUE_ARRAY(ads_metrics, '$.variables')) AS variable
WHERE
CAST(JSON_VALUE(ads_metrics, '$.account_types.reseller.account_count') AS INT64) > 0 OR
CAST(JSON_VALUE(ads_metrics, '$.account_types.direct.account_count') AS INT64) > 0
)

SELECT
variable,
COUNT(DISTINCT page) / ANY_VALUE(total_publishers) AS pct_publishers,
COUNT(DISTINCT page) AS number_of_publishers
FROM ads
GROUP BY variable
ORDER BY pct_publishers DESC
LIMIT 100
35 changes: 35 additions & 0 deletions sql/2024/privacy/cookies_top_first_party_names.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
-- Most common cookie names, by number of domains on which they appear. Goal is to identify common trackers that use first-party cookies across sites.

WITH pages AS (
SELECT
client,
root_page,
custom_metrics,
COUNT(DISTINCT net.host(root_page)) OVER(PARTITION BY client) AS total_domains
FROM `httparchive.all.pages`
WHERE date = '2024-06-01'
), cookies AS (
SELECT
client,
cookie,
NET.HOST(JSON_VALUE(cookie, '$.domain')) AS cookie_host,
NET.HOST(root_page) AS firstparty_host,
total_domains
FROM pages,
UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.cookies')) AS cookie
)

SELECT
client,
COUNT(DISTINCT firstparty_host) AS domain_count,
COUNT(DISTINCT firstparty_host) / any_value(total_domains) AS pct_domains,
JSON_VALUE(cookie, '$.name') AS cookie_name
FROM cookies
WHERE firstparty_host LIKE '%' || cookie_host
GROUP BY
client,
cookie_name
ORDER BY
domain_count DESC,
client DESC
LIMIT 500
35 changes: 35 additions & 0 deletions sql/2024/privacy/cookies_top_third_party_domains.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
WITH pages AS (
SELECT
page,
client,
root_page,
custom_metrics,
COUNT(DISTINCT page) OVER (PARTITION BY client) AS total_pages
FROM `httparchive.all.pages`
WHERE date = '2024-06-01'
), cookies AS (
SELECT
client,
page,
cookie,
NET.HOST(JSON_VALUE(cookie, '$.domain')) AS cookie_host,
NET.HOST(root_page) AS firstparty_host,
total_pages
FROM pages,
UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.cookies')) AS cookie
)

SELECT
client,
cookie_host,
COUNT(DISTINCT page) AS page_count,
COUNT(DISTINCT page) / any_value(total_pages) AS pct_pages
FROM cookies
WHERE firstparty_host NOT LIKE '%' || cookie_host
GROUP BY
client,
cookie_host
ORDER BY
page_count DESC,
client
LIMIT 500
Loading
Loading