Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aligning technology origins and adoption metric between the crux and the crawl data #42

Open
max-ostapenko opened this issue Jan 8, 2025 · 1 comment

Comments

@max-ostapenko
Copy link
Contributor

Quoting @rviscomi :

The number seemed high because I anecdotally know WordPress market share of CMSs is around 75%, but we're only showing 3.2M WordPress origins in the CMS report.
So if there are 8.8M sites that use a CMS, that puts WordPress's market share at 36%, which is way too low.
The issue seems to be that the WordPress count is taken after joining with the CrUX dataset, and many sites have fallen out of CrUX.
Modifying your query to count WordPress sites in November:

SELECT
  COUNT(DISTINCT root_page)
FROM crawl.pages
WHERE date = '2024-11-01'
  AND client = 'mobile'
  AND 'WordPress' IN UNNEST(technologies.technology)

The result is 5785472, which gets us much closer to the expected market share: 65%.
So there are about 2.5M WordPress sites that we're counting in the category total but not in the technology total.
Open to suggestions on how to fix this. One idea is to remove the CrUX join (or do some sort of outer join) when calculating origin counts.
Yeah we subtly changed the name from "CWV Tech Report" to "HTTP Archive Tech Report" so that we could lean more heavily onto the adoption side, so joining forces makes a lot of sense


I see 2 issues here:

  1. we use different URL sets in the report: November crawl is based on Oct CrUX, but we are JOINing it with Nov CrUX. It's either complete or timely. (0.6M discrepancy)
  2. we are not using tablet and NULL clients from CrUX - so more unmatched origins (1.9M). No geo and rank available for aggregation.

A promising analysis logic

Calculate adoption with crawl data, as it's the original source.
This will help us to solve adoption with the most complete set of origins, including the CrUX's tablet and NULL clients.
But only the global ones, geo dimension is part of CrUX and thus unavailable. We could still use INNER JOIN there.

@tunetheweb
Copy link
Member

What would the adoption share be if we included CrUX data in the total? i.e. we insist on both to be in this dataset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants