Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEO 2024 queries #3791

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

henryp25
Copy link

@henryp25 henryp25 commented Oct 14, 2024

Makes progress on #3600

This PR adds the finalized SQL files which now include an is_root_page element that differentiates between the homepage and secondary pages. All SQL files utilize the June dataset, as it was the originating dataset used during the construction of these queries.

Context:
These changes were made to finalize the SQL queries for the 2024 SEO analysis. The new is_root_page element improves data separation between homepages and other pages, enhancing the overall analysis accuracy. Additionally, minor updates were applied to the SQL queries from 2022 to align with the new dataset structure. Common Table Expressions (CTEs) were introduced to improve efficiency and query readability.

Changes Made:

  • Introduced an is_root_page element to separate homepage and secondary page data.
  • Updated all queries to use the June dataset for consistency with the original development.
  • Slight modifications to 2022 SQL files for better compatibility with the new dataset and added CTEs to improve efficiency.

@tunetheweb tunetheweb changed the title Add finalized SQL files with is_root_page element for improved efficiency SEO 2024 queries Oct 14, 2024
@tunetheweb tunetheweb added the analysis Querying the dataset label Oct 14, 2024
Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM with a couple of small comments.

Let me know when good to merge.

page,
getLoadingPropertyMarkupInfo(JSON_EXTRACT_SCALAR(payload, '$._markup')) AS loading_property_markup_info
FROM
`httparchive.all.pages` TABLESAMPLE SYSTEM (0.01 PERCENT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed. Was the query run on the full dataset?

Suggested change
`httparchive.all.pages` TABLESAMPLE SYSTEM (0.01 PERCENT)
`httparchive.all.pages`

page AS site,
getRobotsSize(payload) AS robots_size
FROM
`httparchive.all.pages` TABLESAMPLE SYSTEM (0.01 PERCENT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed. Was the query run on the full dataset?

Suggested change
`httparchive.all.pages` TABLESAMPLE SYSTEM (0.01 PERCENT)
`httparchive.all.pages`

header.value AS request_header_value,
COUNT(DISTINCT page) AS sites,
SUM(COUNT(DISTINCT page)) OVER (PARTITION BY client, is_root_page) AS total,
SAFE_DIVIDE(COUNT(0), SUM(COUNT(0)) OVER ()) AS pct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this OVER () correct?

is_root_page,
REGEXP_CONTAINS(LOWER(IFNULL(request_headers[SAFE_OFFSET(0)].name, '')), r'user-agent') AS resp_vary_user_agent,
COUNT(0) AS freq,
SAFE_DIVIDE(COUNT(0), SUM(COUNT(0)) OVER ()) AS pct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants