-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classification of the URLs in HA dataset #868
Comments
Hello all, Thanks @nrllh for this summary and for the resources provided to classify these hostnames.
I would then release a second table listing the corrections you have started to manually map (eTLD+1 with corresponding topics). That way, we can then do as you describe: write a joint SQL query/UDF that returns the (corrected) classification that the HTTP archive would like to use.
|
Spoke with @pmeenan about this and for now we're hesitant to add this directly into the crawl for maintainability reasons. For unblocking the Web Almanac analysis, we could either host the classifications directly in HTTP Archive's BigQuery project, or if @yohhaan is willing to host it, he'd just need to open up public access to the existing table. My preference would be to host it on HTTP Archive so that we'd absorb any storage costs / maintenance burden. WDYT? If we do store the unprocessed classifications, I like the idea of using a persistent UDF, maybe something like Longer term, we could re-explore classifying as part of the crawl. We should also re-explore classifying hosts using BigQuery ML, which would let us analyze older crawls. |
I processed the hosts from the last crawl as well. We now have what we all need. I've also spoken with @yohhaan, and we would like to store the data in HA's project. You can access the data via:
The table consists of 5 columns:
I also created a UDF to retrieve the category for a given host. The function has two parameters: input_host and row_results. The row_results parameter addresses our discussion about hashed subdomains. When row_results is set to false, the function checks if the host's domain is among the top 50 domains known for hashed subdomains. If so, it returns the category of the domain instead of the hostname. The function returns an array of JSON with the host, category ID, full category, subcategory, and parent category. We may also want to retrieve only full category of the hostnames. Note that we have for some hostnames multiple categories that's why we need to handle that.
|
This is great, thanks! I've cloned the data into CREATE OR REPLACE TABLE `httparchive.urls.categories` AS
SELECT * FROM `misc-researches.ha.ha_host_categories` I've also saved the persistent UDF to
Here's the query: CREATE OR REPLACE FUNCTION `httparchive.fn.GET_HOST_CATEGORIES`(
input_host STRING,
row_results BOOL
) RETURNS ARRAY<STRUCT<
host STRING,
category_id INT64,
full_category STRING,
subcategory STRING,
parent_category STRING
>> AS (
(
WITH
host_to_query AS (
SELECT
CASE
WHEN NOT row_results AND net.reg_domain(input_host) IN (
'googlesyndication.com', 'gstatic.com', 'imrworldwide.com', 'cloudfront.net', 'upravel.com',
'adsco.re', 'fastly-insights.com', 'cedexis-radar.net', 'beeline.ru', 'quora.com',
'online-metrix.net', 'ampproject.net', 'bumlam.com', 'forter.com', 'googleusercontent.com',
'dropboxusercontent.com', 'yahoodns.net', 'hinet.net', 'alibaba.com', 'tumblr.com',
'amazonaws.com', 'googleadservices.com', 'akamaihd.net', 'filesusr.com', 'dotnxdomain.net',
'nitrocdn.com', 'doubleclick.net', 'disqus.com', 'business.site', 'softonic.com', 'sensic.net',
'zendesk.com', 'stbid.ru', 'uptodown.com', 'wpengine.com', 'dca0.com', 'onef.pro',
'netdna-ssl.com', 'secureserver.net', 'bandcamp.com', 'parsely.com', 'editmysite.com',
'footprintdns.com', 'ioam.de', 'ridge1.com', 'optimole.com', 'whiteboxdigital.ru',
'sentry.io', 'mysimplestore.com', 'wix-code.com', 'smushcdn.com'
) THEN net.reg_domain(input_host)
ELSE input_host
END AS query_host
),
-- Retrieve the categories
categories_data AS (
SELECT
hostname,
category_id,
full_category,
subcategory,
parent_category
FROM
`httparchive.urls.categories`
WHERE
hostname = (SELECT query_host FROM host_to_query)
)
SELECT
ARRAY_AGG(STRUCT(
hostname AS host,
category_id,
full_category,
subcategory,
parent_category
))
FROM
categories_data
)
); Example usage: SELECT
*
FROM
UNNEST(httparchive.fn.GET_HOST_CATEGORIES('npr.org', TRUE)) Results: [{
"host": "npr.org",
"category_id": "23",
"full_category": "/Arts \u0026 Entertainment/Music \u0026 Audio",
"subcategory": "Music \u0026 Audio",
"parent_category": "/Arts \u0026 Entertainment/"
}, {
"host": "npr.org",
"category_id": "243",
"full_category": "/News",
"subcategory": "News",
"parent_category": "/News"
}] @nrllh I'm not so sure about the |
Thank you. Changes made:
I'll prepare a document for har.fyi so we can announce it. Please let me know once you have updated the function. |
Updated! SELECT
*
FROM
UNNEST(httparchive.fn.GET_HOST_CATEGORIES('apple.com')) [{
"host": "apple.com",
"category_id": "528",
"full_category": "/Internet \u0026 Telecom/Mobile Phones",
"subcategory": "Mobile Phones",
"parent_category": "/Internet \u0026 Telecom/"
}, {
"host": "apple.com",
"category_id": "129",
"full_category": "/Computers \u0026 Electronics/Consumer Electronics",
"subcategory": "Consumer Electronics",
"parent_category": "/Computers \u0026 Electronics/"
}] |
Hi all,
as previously announced in Slack, we wanted to classify the URLs, and we hope to have this solved soon. We classified over 110M different hostnames. In this issue, I want to give you an overview of the method we applied, and we have some discussion points to address.
How we classified and the results
We began by extracting distinct hostnames from all requests and extended the data with the domain name (eTLD+1) of these hostnames. This gave us a total of 110M distinct hostnames to classify. Using @yohhaan’s repository, which leverages the Google Topics API, we managed to classify 89% of the hostnames. 11% could not be classified due to the Topics API excluding adult sites (e.g., gambling). The dataset is currently hosted in my private project and can be queried:
Points to discuss
The model's classification for these hostnames is very inconsistent, likely due to the hash values. Therefore, we should assign the classification of
googlesyndication.com
to all its hostnames. This issue also exists for other popular providers. I have created a sheet and manually assigned which popular sites with hashed subdomains should use the classification of the domain instead of the subdomain: Classification of sites with many subdomains.Many thanks to @yohhaan for his contributions throughout the entire process! He adjusted his repository and ran the scripts for the classification.
The text was updated successfully, but these errors were encountered: