Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privacy 2024 Chapter #3817

Draft
wants to merge 119 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
251654f
readme
max-ostapenko May 3, 2024
537aa53
copied 2022 SQLs over to update/review
max-ostapenko May 3, 2024
237a280
fixed link
max-ostapenko May 3, 2024
66cccb2
origin trials
max-ostapenko Jun 9, 2024
736a7ab
Bump puppeteer from 22.7.1 to 22.8.0 in /src (#3655)
dependabot[bot] May 7, 2024
86506a3
notebook + readme (#3652)
max-ostapenko May 7, 2024
99aee63
Bump pytest from 8.1.1 to 8.2.0 in /src (#3651)
dependabot[bot] May 7, 2024
f76f679
Translation of privacy chapter to Japanese (#3654)
ksakae1216 May 7, 2024
c29735a
Update Timestamps (#3657)
github-actions[bot] May 7, 2024
47811ca
2023 Performance (#3525)
rviscomi May 15, 2024
46dabdc
Bump puppeteer from 22.8.0 to 22.9.0 in /src (#3662)
dependabot[bot] May 19, 2024
743f560
Upgrade to web-vitals v4 (#3661)
rviscomi May 20, 2024
0983d6c
Bump pytest from 8.2.0 to 8.2.1 in /src (#3664)
dependabot[bot] May 21, 2024
42e4599
--- (#3665)
dependabot[bot] May 22, 2024
22ec347
Bump puppeteer from 22.9.0 to 22.10.0 in /src (#3668)
dependabot[bot] May 24, 2024
1eb5873
Bump jsdom from 24.0.0 to 24.1.0 in /src (#3669)
dependabot[bot] May 27, 2024
c566929
Typofix (#3670)
borisschapira May 28, 2024
2d4bceb
SQL and MD folders the 2024 Web Almanac (#3666)
ChrisBeeti May 29, 2024
2c87cf2
Bump prettier from 3.2.5 to 3.3.0 in /src (#3672)
dependabot[bot] Jun 4, 2024
d60de8c
Bump pytest from 8.2.1 to 8.2.2 in /src (#3673)
dependabot[bot] Jun 5, 2024
4f925b8
Bump prettier from 3.3.0 to 3.3.1 in /src (#3674)
dependabot[bot] Jun 5, 2024
846c710
Fix loaf monitoring bug (#3675)
tunetheweb Jun 5, 2024
524c51c
Update Timestamps (#3677)
github-actions[bot] Jun 5, 2024
6f441ff
Bump web-vitals from 4.0.1 to 4.1.0 in /src (#3678)
dependabot[bot] Jun 7, 2024
689d2ef
fixed link
max-ostapenko May 3, 2024
dd21cf0
remove unreviewed sql
max-ostapenko Jun 9, 2024
be50c36
Merge branch 'main' into privacy-sql-2024
max-ostapenko Jun 9, 2024
320ebbe
lint test
max-ostapenko Jun 9, 2024
21f612e
lint
max-ostapenko Jun 9, 2024
d0c2c35
ads supply graph
max-ostapenko Jul 14, 2024
27511d0
lint
max-ostapenko Jul 14, 2024
8cd1e83
close file
max-ostapenko Jul 14, 2024
da015db
lint
max-ostapenko Jul 14, 2024
b7179e4
top_direct_sellers
max-ostapenko Jul 20, 2024
4de4c61
ads_txt_lines_histogram
max-ostapenko Jul 20, 2024
de249eb
ads_txt_seller_accounts_by_type
max-ostapenko Jul 20, 2024
ddf2ba8
top_ads_variables
max-ostapenko Jul 20, 2024
4e24a59
format
max-ostapenko Jul 20, 2024
3552776
tcf2
max-ostapenko Jul 21, 2024
5cc3695
rename
max-ostapenko Jul 21, 2024
4796653
lint
max-ostapenko Jul 21, 2024
cd6cac0
using custom_metrics
max-ostapenko Jul 25, 2024
9d11fcc
most_common_cname_domains
max-ostapenko Jul 25, 2024
ab54d6a
adguard list
max-ostapenko Aug 4, 2024
d9242dd
gpc
max-ostapenko Aug 4, 2024
17c4455
referrer policy
max-ostapenko Aug 4, 2024
52d57b5
usp
max-ostapenko Aug 4, 2024
234ef27
iab frameworks
max-ostapenko Aug 5, 2024
0bce587
lint
max-ostapenko Aug 5, 2024
8c60240
bounce trackers
max-ostapenko Aug 5, 2024
b1b47bc
Added privacy sandbox related queries
Yash-Vekaria Aug 13, 2024
14136ae
lint
Yash-Vekaria Aug 13, 2024
d6b1db4
missed lint
Yash-Vekaria Aug 13, 2024
a83f88d
dnt
max-ostapenko Aug 14, 2024
cf99788
client hints
max-ostapenko Aug 14, 2024
7fc52f4
whotracksme update
max-ostapenko Aug 14, 2024
95dd276
lint
max-ostapenko Aug 14, 2024
23ce85b
referrer policy
max-ostapenko Aug 14, 2024
27e4d43
rank filter removed
max-ostapenko Aug 14, 2024
109e807
trackers
max-ostapenko Aug 15, 2024
d41de3d
util deps
max-ostapenko Aug 15, 2024
266fa78
limits
max-ostapenko Aug 15, 2024
b90332f
Privacy 2024 queries - CCPA, fingerprinting, cookies (#3720)
bstandaert-wustl Aug 15, 2024
29cccaf
bq to sheets updates
max-ostapenko Aug 15, 2024
b35e6de
query optimisation
max-ostapenko Aug 15, 2024
22a21ef
downgrade for python 3.8
max-ostapenko Aug 15, 2024
7ea017b
more categories
max-ostapenko Aug 15, 2024
ff429ff
more categories and columns reordered
max-ostapenko Aug 15, 2024
5afff7c
forms and formatted logs
max-ostapenko Aug 15, 2024
37c42d3
Refactoring queries to produce output for queries only
Yash-Vekaria Aug 15, 2024
0d39f6b
lint
max-ostapenko Aug 16, 2024
1c4e468
Merge branch 'main' into privacy-sql-2024
max-ostapenko Aug 16, 2024
a239c25
lint
max-ostapenko Aug 16, 2024
baf490d
Privacy Sql Tracking Detection Using Easylist Adservers (#3730)
hadiamjad Aug 16, 2024
4ab293f
log query errors
max-ostapenko Aug 17, 2024
3fb692e
Fixed privacy sandbox attestation query bug
Yash-Vekaria Aug 17, 2024
58dac23
maximum_bytes_billed parameter
max-ostapenko Aug 17, 2024
6f99ae6
moved to chapter root
max-ostapenko Aug 17, 2024
0b4898d
postpone dryrun check
max-ostapenko Aug 17, 2024
5445b92
fingerprinting_most_common_apis: improve resilience to malformed JSON…
bstandaert-wustl Aug 17, 2024
dac1167
optional maximum_bytes_billed parameter
max-ostapenko Aug 17, 2024
3d8cb6d
formatting
max-ostapenko Aug 18, 2024
e8a032a
queries and notebook updates
max-ostapenko Aug 18, 2024
82c084e
queries to rerun
max-ostapenko Aug 18, 2024
ed8944c
origin trials function fix
max-ostapenko Aug 19, 2024
bc6a045
optimised sellers count
max-ostapenko Aug 19, 2024
a917161
apps included in ads.txt lines
max-ostapenko Aug 19, 2024
c51a3e7
another rerun
max-ostapenko Aug 19, 2024
2792d67
lint
max-ostapenko Aug 19, 2024
b2a7f4f
no origins
max-ostapenko Aug 20, 2024
51a71f0
optimized perf
max-ostapenko Aug 20, 2024
23a72c7
more optimized perf
max-ostapenko Aug 20, 2024
c8450a0
graph optimization and OT expiration
max-ostapenko Aug 21, 2024
17ded3e
Merge remote-tracking branch 'origin/main' into privacy-sql-2024
max-ostapenko Aug 21, 2024
e29a3eb
earlier grouping for performance
max-ostapenko Aug 21, 2024
975f7c8
graph fixes
max-ostapenko Aug 21, 2024
fda33dd
cookies, ccpa, fingerprinting: calculate percent of total pages
bstandaert-wustl Aug 22, 2024
0c30a7a
query for top third-party cookie names
bstandaert-wustl Aug 24, 2024
cfde873
bq writer module
max-ostapenko Sep 18, 2024
ac6e895
add grouping
max-ostapenko Oct 1, 2024
fe31518
domain suffixes and regexes removed
max-ostapenko Oct 1, 2024
741b655
Merge remote-tracking branch 'origin/main' into privacy-sql-2024
max-ostapenko Oct 28, 2024
860d3d4
Merge remote-tracking branch 'origin/privacy-sql-2024' into privacy-m…
max-ostapenko Oct 28, 2024
c9862cc
chapter draft
max-ostapenko Oct 28, 2024
6375c38
staging config
max-ostapenko Oct 28, 2024
e5dbdd9
render draft
max-ostapenko Oct 28, 2024
3c9ad62
clear chapter
max-ostapenko Oct 28, 2024
9d2f072
Privacy 2024 chapter: stateful and stateless tracking (#3813)
bstandaert-wustl Oct 28, 2024
ac20903
Merge branch 'privacy-markdown-2024' of https://github.com/HTTPArchiv…
max-ostapenko Oct 28, 2024
8550e40
test charts
max-ostapenko Oct 28, 2024
dd81874
Optimised images with calibre/image-actions
github-actions[bot] Oct 28, 2024
f31b08b
Merge branch 'privacy-markdown-2024' of https://github.com/HTTPArchiv…
max-ostapenko Oct 28, 2024
186a2e4
cross-site tracking charts images
max-ostapenko Oct 30, 2024
ae36270
Optimised images with calibre/image-actions
github-actions[bot] Oct 30, 2024
8b76a27
Merge branch 'privacy-markdown-2024' of https://github.com/HTTPArchiv…
max-ostapenko Oct 30, 2024
6cc381e
cname and bounce
max-ostapenko Oct 30, 2024
173b731
rename
max-ostapenko Oct 30, 2024
bd13f3e
fixes
max-ostapenko Oct 30, 2024
781d0af
rollback
max-ostapenko Oct 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions sql/2024/privacy/ads_accounts_distribution.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
WITH publishers AS (
SELECT
page,
JSON_QUERY(custom_metrics, '$.ads.ads.account_types') AS ads_account_types,
JSON_QUERY(custom_metrics, '$.ads.app_ads.account_types') AS app_ads_account_types
FROM `httparchive.all.pages`
WHERE date = '2024-06-01' AND
is_root_page = TRUE AND
(CAST(JSON_VALUE(custom_metrics, '$.ads.ads.account_count') AS INT64) > 0 OR
CAST(JSON_VALUE(custom_metrics, '$.ads.app_ads.account_count') AS INT64) > 0)
), ads_accounts AS (
SELECT
page,
CEIL(CAST(JSON_VALUE(ads_account_types, '$.direct.account_count') AS INT64) / 100) * 100 AS direct_account_count_bucket,
CEIL(CAST(JSON_VALUE(ads_account_types, '$.reseller.account_count') AS INT64) / 100) * 100 AS reseller_account_count_bucket,
COUNT(DISTINCT page) OVER () AS total_pages
FROM publishers
), app_ads_accounts AS (
SELECT
page,
CEIL(CAST(JSON_VALUE(app_ads_account_types, '$.direct.account_count') AS INT64) / 100) * 100 AS direct_account_count_bucket,
CEIL(CAST(JSON_VALUE(app_ads_account_types, '$.reseller.account_count') AS INT64) / 100) * 100 AS reseller_account_count_bucket,
COUNT(DISTINCT page) OVER () AS total_pages
FROM publishers
)

SELECT
'ads' AS source,
'direct' AS account_type,
direct_account_count_bucket AS account_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM ads_accounts
GROUP BY source, direct_account_count_bucket
UNION ALL
SELECT
'ads' AS source,
'reseller' AS account_type,
reseller_account_count_bucket AS account_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM ads_accounts
GROUP BY source, reseller_account_count_bucket
UNION ALL
SELECT
'app_ads' AS source,
'direct' AS account_type,
direct_account_count_bucket AS account_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM app_ads_accounts
GROUP BY source, direct_account_count_bucket
UNION ALL
SELECT
'app_ads' AS source,
'reseller' AS account_type,
reseller_account_count_bucket AS account_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM app_ads_accounts
GROUP BY source, reseller_account_count_bucket

ORDER BY account_count_bucket ASC
LIMIT 1000
114 changes: 114 additions & 0 deletions sql/2024/privacy/ads_and_sellers_graph.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
WITH RECURSIVE pages AS (
SELECT
CASE page -- publisher websites may redirect to an SSP domain, and need to use redirected domain instead of page domain
WHEN 'https://www.chunkbase.com/' THEN 'cafemedia.com'
ELSE NET.REG_DOMAIN(page)
END AS page,
custom_metrics
FROM `httparchive.all.pages`
WHERE date = '2024-06-01' AND
client = 'mobile' AND
is_root_page = TRUE
), ads AS (
SELECT
page,
JSON_QUERY(custom_metrics, '$.ads.ads.account_types') AS ad_accounts
FROM pages
WHERE
CAST(JSON_VALUE(custom_metrics, '$.ads.ads.account_count') AS INT64) > 0
), sellers AS (
SELECT
page,
JSON_QUERY(custom_metrics, '$.ads.sellers.seller_types') AS ad_sellers
FROM pages
WHERE
CAST(JSON_VALUE(custom_metrics, '$.ads.sellers.seller_count') AS INT64) > 0
), relationships_web AS (
SELECT
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS demand,
'Web' AS supply,
'direct' AS relationship,
page AS publisher
FROM ads, UNNEST(JSON_VALUE_ARRAY(ad_accounts, '$.direct.domains')) AS domain
UNION ALL
SELECT
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS demand,
'Web' AS supply,
'indirect' AS relationship,
page AS publisher
FROM ads, UNNEST(JSON_VALUE_ARRAY(ad_accounts, '$.reseller.domains')) AS domain
UNION ALL
SELECT
page AS demand,
'Web' AS supply,
'direct' AS relationship,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS publisher
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.publisher.domains')) AS domain
UNION ALL
SELECT
page AS demand,
'Web' AS supply,
'direct' AS relationship,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS publisher
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.both.domains')) AS domain
), relationships_adtech AS (
SELECT
page AS demand,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS supply,
'indirect' AS relationship
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.intermediary.domains')) AS domain
UNION ALL
SELECT
page AS demand,
NET.REG_DOMAIN(REGEXP_EXTRACT(NORMALIZE_AND_CASEFOLD(domain), r'\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b')) AS supply,
'indirect' AS relationship
FROM sellers, UNNEST(JSON_VALUE_ARRAY(ad_sellers, '$.both.domains')) AS domain
), nodes AS (
(
SELECT
demand,
supply,
CONCAT(demand, '-', supply) AS path,
relationship,
HLL_COUNT.INIT(publisher) AS supply_sketch
FROM relationships_web
GROUP BY demand, supply, relationship
)
UNION ALL
(
SELECT
relationships_grouped.demand AS demand,
relationships_grouped.supply AS supply,
CONCAT(relationships_grouped.demand, '-', nodes.path) AS path,
relationships_grouped.relationship AS relationship,
nodes.supply_sketch AS supply_sketch
FROM (
SELECT
demand,
supply,
relationship
FROM relationships_adtech
GROUP BY
demand,
supply,
relationship
) AS relationships_grouped
INNER JOIN nodes
ON relationships_grouped.supply = nodes.demand AND
nodes.supply_sketch IS NOT NULL AND
nodes.relationship = 'indirect' AND
relationships_grouped.demand IS NOT NULL AND
STRPOS(nodes.path, relationships_grouped.demand) = 0
)
)

SELECT
supply,
demand,
HLL_COUNT.MERGE(supply_sketch) AS publishers_count,
relationship,
path
FROM nodes
GROUP BY demand, supply, relationship, path
ORDER BY publishers_count DESC
LIMIT 5000
45 changes: 45 additions & 0 deletions sql/2024/privacy/ads_lines_distribution.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
WITH RECURSIVE pages AS (
SELECT
CASE page -- publisher websites may redirect to an SSP domain, and need to use redirected domain instead of page domain
WHEN 'https://www.chunkbase.com/' THEN 'cafemedia.com'
ELSE NET.REG_DOMAIN(page)
END AS page,
CAST(JSON_VALUE(custom_metrics, '$.ads.ads.line_count') AS INT64) AS ads_line_count,
CAST(JSON_VALUE(custom_metrics, '$.ads.app_ads.line_count') AS INT64) AS app_ads_line_count
FROM `httparchive.all.pages`
WHERE date = '2024-06-01' AND
is_root_page = TRUE
), ads AS (
SELECT
page,
CEIL(ads_line_count / 100) * 100 AS line_count_bucket,
COUNT(DISTINCT page) OVER () AS total_pages
FROM pages
WHERE ads_line_count > 0
), app_ads AS (
SELECT
page,
CEIL(app_ads_line_count / 100) * 100 AS line_count_bucket,
COUNT(DISTINCT page) OVER () AS total_pages
FROM pages
WHERE app_ads_line_count > 0
)

SELECT
'ads.txt' AS type,
line_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM ads
GROUP BY line_count_bucket
HAVING line_count_bucket <= 10000
UNION ALL
SELECT
'app-ads.txt' AS type,
line_count_bucket,
COUNT(DISTINCT page) / ANY_VALUE(total_pages) AS pct_pages,
COUNT(DISTINCT page) AS number_of_pages
FROM app_ads
GROUP BY line_count_bucket
HAVING line_count_bucket <= 10000
ORDER BY type, line_count_bucket ASC
5 changes: 5 additions & 0 deletions sql/2024/privacy/ccpa_most_common_phrases.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
WITH pages_with_phrase AS (
SELECT client, rank_grouping, page, count(DISTINCT page) OVER (PARTITION BY client, rank_grouping) AS total_pages_with_phrase_in_rank_group, JSON_QUERY_ARRAY(custom_metrics, '$.privacy.ccpa_link.CCPALinkPhrases') AS ccpa_link_phrases FROM `httparchive.all.pages`, --TABLESAMPLE SYSTEM (0.01 PERCENT)
UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS rank_grouping WHERE date = '2024-06-01' AND is_root_page = true AND rank <= rank_grouping AND array_length(JSON_QUERY_ARRAY(custom_metrics, '$.privacy.ccpa_link.CCPALinkPhrases')) > 0
)
SELECT client, rank_grouping, link_phrase, count(DISTINCT page) AS num_pages, count(DISTINCT page) / any_value(total_pages_with_phrase_in_rank_group) AS pct_pages FROM pages_with_phrase, unnest(ccpa_link_phrases) link_phrase GROUP BY link_phrase, rank_grouping, client ORDER BY rank_grouping, client, num_pages DESC
6 changes: 6 additions & 0 deletions sql/2024/privacy/ccpa_prevalence.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
WITH pages AS (
SELECT client, rank_grouping, page, JSON_VALUE(custom_metrics, '$.privacy.ccpa_link.hasCCPALink') AS has_ccpa_link FROM `httparchive.all.pages`,
-- TABLESAMPLE SYSTEM (0.0025 PERCENT)
UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS rank_grouping WHERE date = '2024-06-01' AND is_root_page = true AND rank <= rank_grouping
)
SELECT client, rank_grouping, has_ccpa_link, count(DISTINCT page) AS num_pages FROM pages GROUP BY has_ccpa_link, rank_grouping, client ORDER BY rank_grouping, client, has_ccpa_link
29 changes: 29 additions & 0 deletions sql/2024/privacy/common_ads_variables.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
WITH RECURSIVE pages AS (
SELECT
page,
JSON_QUERY(custom_metrics, '$.ads.ads') AS ads_metrics
FROM `httparchive.all.pages`
WHERE
date = '2024-06-01' AND
is_root_page = TRUE AND
CAST(JSON_VALUE(custom_metrics, '$.ads.ads.account_count') AS INT64) > 0
), ads AS (
SELECT
page,
variable,
COUNT(DISTINCT page) OVER() AS total_publishers
FROM pages,
UNNEST(JSON_VALUE_ARRAY(ads_metrics, '$.variables')) AS variable
WHERE
CAST(JSON_VALUE(ads_metrics, '$.account_types.reseller.account_count') AS INT64) > 0 OR
CAST(JSON_VALUE(ads_metrics, '$.account_types.direct.account_count') AS INT64) > 0
)

SELECT
variable,
COUNT(DISTINCT page) / ANY_VALUE(total_publishers) AS pct_publishers,
COUNT(DISTINCT page) AS number_of_publishers
FROM ads
GROUP BY variable
ORDER BY pct_publishers DESC
LIMIT 100
10 changes: 10 additions & 0 deletions sql/2024/privacy/cookies_top_first_party_names.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
-- Most common cookie names, by number of domains on which they appear. Goal is to identify common trackers that use first-party cookies across sites.

WITH pages AS (
SELECT client, root_page, custom_metrics, count(DISTINCT net.host(root_page)) OVER(PARTITION BY client) AS total_domains FROM `httparchive.all.pages` -- TABLESAMPLE SYSTEM (0.00001 PERCENT)
WHERE date = '2024-06-01'
),
cookies AS (
SELECT client, cookie, net.host(JSON_VALUE(cookie, '$.domain')) AS cookie_host, net.host(root_page) AS firstparty_host, total_domains FROM pages, UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.cookies')) cookie
)
SELECT client, count(DISTINCT firstparty_host) AS domain_count, count(DISTINCT firstparty_host) / any_value(total_domains) AS pct_domains, JSON_VALUE(cookie, '$.name') AS cookie_name FROM cookies WHERE firstparty_host LIKE '%' || cookie_host GROUP BY client, cookie_name ORDER BY domain_count DESC, client DESC LIMIT 500
8 changes: 8 additions & 0 deletions sql/2024/privacy/cookies_top_third_party_domains.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
WITH pages AS (
SELECT page, client, root_page, custom_metrics, count(DISTINCT page) OVER (PARTITION BY client) AS total_pages FROM `httparchive.all.pages` -- TABLESAMPLE SYSTEM (0.1 PERCENT)
WHERE date = '2024-06-01'
),
cookies AS (
SELECT client, page, cookie, net.host(JSON_VALUE(cookie, '$.domain')) AS cookie_host, net.host(root_page) AS firstparty_host, total_pages FROM pages, UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.cookies')) cookie
)
SELECT client, cookie_host, count(DISTINCT page) AS page_count, count(DISTINCT page) / any_value(total_pages) AS pct_pages FROM cookies WHERE firstparty_host NOT LIKE '%' || cookie_host GROUP BY client, cookie_host ORDER BY page_count DESC, client LIMIT 500
10 changes: 10 additions & 0 deletions sql/2024/privacy/cookies_top_third_party_names.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
-- Most common cookie names, by number of domains on which they appear. Goal is to identify common trackers that set cookies using many domains.

WITH pages AS (
SELECT client, root_page, custom_metrics, count(DISTINCT net.host(root_page)) OVER(PARTITION BY client) AS total_domains FROM `httparchive.all.pages` -- TABLESAMPLE SYSTEM (0.00001 PERCENT)
WHERE date = '2024-06-01'
),
cookies AS (
SELECT client, cookie, net.host(JSON_VALUE(cookie, '$.domain')) AS cookie_host, net.host(root_page) AS firstparty_host, total_domains FROM pages, UNNEST(JSON_QUERY_ARRAY(custom_metrics, '$.cookies')) cookie
)
SELECT client, count(DISTINCT firstparty_host) AS domain_count, count(DISTINCT firstparty_host) / any_value(total_domains) AS pct_domains, JSON_VALUE(cookie, '$.name') AS cookie_name FROM cookies WHERE firstparty_host NOT LIKE '%' || cookie_host GROUP BY client, cookie_name ORDER BY domain_count DESC, client DESC LIMIT 500
38 changes: 38 additions & 0 deletions sql/2024/privacy/easylist-tracker-detection.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
CREATE TEMP FUNCTION
CheckDomainInURL(url STRING, domain STRING)
RETURNS INT64
LANGUAGE js AS """
return url.includes(domain) ? 1 : 0;
""";

-- We need to use the `easylist_adservers.csv` to populate the table to get the list of domains to block
-- https://github.com/easylist/easylist/blob/master/easylist/easylist_adservers.txt
WITH easylist_data AS (
SELECT string_field_0
FROM `httparchive.almanac.easylist_adservers`
),
requests_data AS (
SELECT url
FROM `httparchive.all.requests`
WHERE
date = '2024-06-01' AND
is_root_page = TRUE
),
block_status AS (
SELECT
r.url,
MAX(
CASE
WHEN CheckDomainInURL(r.url, e.string_field_0) = 1 THEN 1
ELSE 0
END
) AS should_block
FROM requests_data r
LEFT JOIN easylist_data e
ON CheckDomainInURL(r.url, e.string_field_0) = 1
GROUP BY r.url
)
SELECT
COUNT(0) AS blocked_url_count
FROM block_status
WHERE should_block = 1;
19 changes: 19 additions & 0 deletions sql/2024/privacy/fingerprinting_most_common_apis.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
CREATE TEMP FUNCTION getFingerprintingTypes(input STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
if (input) {
try {
return Object.keys(JSON.parse(input))
} catch (e) {
return []
}
} else {
return []
}
""";

WITH pages AS (
SELECT client, page, fingerprinting_type, count(DISTINCT page) OVER (PARTITION BY client) AS total_pages FROM `httparchive.all.pages`, --TABLESAMPLE SYSTEM (0.01 PERCENT)
unnest(getFingerprintingTypes(JSON_EXTRACT(custom_metrics, '$.privacy.fingerprinting.counts'))) AS fingerprinting_type WHERE date = '2024-06-01'
)
SELECT client, fingerprinting_type, count(DISTINCT page) AS page_count, count(DISTINCT page) / any_value(total_pages) AS pct_pages FROM pages GROUP BY client, fingerprinting_type ORDER BY page_count DESC
6 changes: 6 additions & 0 deletions sql/2024/privacy/fingerprinting_most_common_scripts.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
WITH pages AS (
SELECT page, client, custom_metrics, count(DISTINCT page) OVER (PARTITION BY client) AS total_pages FROM `httparchive.all.pages` --TABLESAMPLE SYSTEM (0.001 PERCENT)
WHERE date = '2024-06-01'
)
SELECT client, script, count(DISTINCT page) AS page_count, count(DISTINCT page) / any_value(total_pages) AS pct_pages FROM pages,
unnest(JSON_QUERY_ARRAY(custom_metrics, '$.privacy.fingerprinting.likelyFingerprintingScripts')) AS script GROUP BY client, script ORDER BY page_count DESC LIMIT 100;
5 changes: 5 additions & 0 deletions sql/2024/privacy/fingerprinting_script_count.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
WITH pages AS (
SELECT page, client, array_length(JSON_QUERY_ARRAY(custom_metrics, '$.privacy.fingerprinting.likelyFingerprintingScripts')) AS script_count, count(DISTINCT page) OVER (PARTITION BY client) AS total_pages FROM `httparchive.all.pages` --TABLESAMPLE SYSTEM (0.01 PERCENT)
WHERE date = '2024-06-01'
)
SELECT script_count, client, count(DISTINCT page) AS page_count, count(DISTINCT page) / any_value(total_pages) AS pct_pages FROM pages GROUP BY script_count, client ORDER BY script_count ASC;
Loading
Loading