Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technologies are missing in 2024-12-01 crawl #31

Open
max-ostapenko opened this issue Jan 9, 2025 · 2 comments
Open

Technologies are missing in 2024-12-01 crawl #31

max-ostapenko opened this issue Jan 9, 2025 · 2 comments

Comments

@max-ostapenko
Copy link

Checked like this:

SELECT
  date,
  client,
  category,
  COUNT(DISTINCT root_page)
FROM crawl.pages,
  UNNEST(technologies) AS tech,
  UNNEST(tech.categories) AS category
WHERE
  date >= '2024-10-01'
  AND tech.technology = 'WordPress'
GROUP BY 1,2,3
ORDER BY 1,2,3;
date	        client	category        f0_
2024-10-01	desktop	Blogs	        4570853
2024-10-01	desktop	CMS	        4570853
2024-10-01	desktop	Miscellaneous	1
2024-10-01	mobile	Blogs	        5853962
2024-10-01	mobile	CMS	        5853962
2024-10-01	mobile	Miscellaneous	1
2024-11-01	desktop	Blogs	        4555965
2024-11-01	desktop	CMS	        4555965
2024-11-01	mobile	Blogs	        5785472
2024-11-01	mobile	CMS	        5785472
2024-12-01	desktop	Blogs	        386179
2024-12-01	desktop	CMS	        386179
2024-12-01	mobile	Blogs         	382591
2024-12-01	mobile	CMS     	382591

@pmeenan do you have an idea?
Any way to restore?

@pmeenan
Copy link
Member

pmeenan commented Jan 9, 2025

Likely a result of this.

I can revert it an take another run at the change after looking closer to see why it didn't work as expected (maybe something about the inferred technologies not getting picked up).

If we still have the _detected_apps and _detected in the page json payload we may be able to reconstruct it but if they are being stripped out we won't be able to.

@max-ostapenko
Copy link
Author

max-ostapenko commented Jan 10, 2025

We stripped all the duplicates...

Please add any additional objects to payload to make the changes comparable.
Then we can look into them on staging and choose the best version for production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants