"Data model" page is too long #126053

barneygale · 2024-10-27T18:19:01Z

Documentation

The Data model document is very long, and as a result it basically never shows up in search engine results, because 90% of the page is considered irrelevant for any query like "python __hash__".

I suggest we split it up by top-level topic, e.g. we add a dedicated page for "Special method names".

See also #126052

rhettinger · 2024-10-28T03:14:31Z

Try not to break all the external links going into the pages. We don't want to invalidate all the references from blogs, tweets, stackoverflow answers, etc.

With regard to search engine results, I don't think we can or should engage in SEO. There is no promise that rearrangements will lead to being a top hit for a search.

barneygale · 2024-10-28T14:55:07Z

Try not to break all the external links going into the pages. We don't want to invalidate all the references from blogs, tweets, stackoverflow answers, etc.

Presumably this is impossible, right @picnixz?

ncoghlan · 2024-10-28T15:07:58Z

My suggestion for refactoring these large pages while mitigating the damage to existing deep links was to:

Make the existing page name an orphan that exists solely as a navigation page to get from stale deep links to updated semantic references
Ensure the new subfolder name can exist in parallel with the old file name (e.g. data_model/ in this case)

The damage to existing deep links that can't (or won't) be changed is still a good reason to tread carefully, but never being able to split pages as they grow over time isn't a great situation either.

For more background on why we should preserve link integrity as much as we can, the World Wide Web Consortium has a decent page here on why "Cool URIs Don't Change": https://www.w3.org/Provider/Style/URI

picnixz · 2024-10-28T15:15:59Z

Presumably this is impossible, right

Mmmh. It could be possible actually but this would require a custom Sphinx extension and custom redirection ~~at the nginx / apache level~~ where old URLs would redirect to new ones (the Sphinx extension will be used to extract the mapping). It's also a bit of a hacky solution but I don't have a better alternative (a pure Sphinx solution may not be possible because we don't want a dead link if an article cites something like https://docs.python.org/3/reference/datamodel.html#numbers-number; auto-generated doc using :class:`numbers.Number` would be fine since the intersphinx inventory would be updated but raw links won't).

If you want to improve SEO, isn't there a way to indicate in an HTML document that this or that text is more important than something else (e.g., with some aria label or whatever HTML feature we may have)?

More generally, if you want to split the HTML, it's more of a server-side issue rather than a Sphinx issue (where the server would redirect to the appropriate page). So some redirect rules will need to be rewritten (and I don't know how much it could slow down the entire docs website).

Alyssa's suggestion on having a page serving as a hub is possible but it will be a bit ugly (because we still need to make all possible anchors available on that page so that users can re-click on them to have the expanded content).

barneygale · 2024-10-28T15:24:05Z

Alyssa's suggestion on having a page serving as a hub is possible but it will be a bit ugly (because we still need to make all possible anchors available on that page so that users can re-click on them to have the expanded content).

Could the Sphinx extension glue together several pages to form datamodel.html? It would resemble the existing page (perhaps with a small amount of jankyness), but it would be an "orphan" page with no incoming links from the rest of the Python docs. At the top we could add a banner:

The Python data model documentation has been split into several chapters. This page combines those chapters into a single document; it exists solely to keep existing links working.

ncoghlan · 2024-10-28T15:28:43Z

The original Py2-as-default -> Py3-as-default in https://peps.python.org/pep-0430/ was certainly all server-side redirect config. And yeah, I agree the orphaned navigation page isn't a good solution, it's just a better option than leaving people with either a 404 or an unanchored link to the start of a page with less inline content.

Unfortunately, web server rewrite rules can't help us here, as the anchor tag part is never sent to the server - it's handled by the browser after downloading the page. HTTP redirects don't help either, as they also operate at the page level.

It should be possible to do something clever with client side JavaScript: https://stackoverflow.com/questions/1305211/javascript-to-redirect-from-anchor-to-a-separate-page (and that could potentially be extended further to handle smaller cases like the deep links I recently broke by moving the Py_Main C API docs to a different page in #78387).

picnixz · 2024-10-28T15:29:48Z

Could the Sphinx extension glue together several pages to form datamodel.html

If you're worried about the length of datamodel.rst, then you can do it natively using .. include:: directives.

Ah yes, I forgotten about the redirection using JS. I was confused because I actually thought about server-side rendering. Now using JS can be integrated in Sphinx directly (IIRC).

hugovk · 2024-10-28T15:31:12Z

If you want to improve SEO, isn't there a way to indicate in an HTML document that this or that text is more important than something else (e.g., with some aria label or whatever HTML feature we may have)?

We've no way of knowing which of the 18k words (or 25k in #126052) is the important text that any given visitor is interested in. That's why more granular pages will help.

ncoghlan · 2024-10-28T15:52:47Z

(We may want to break out a separate pre-requisite issue for this, but continuing here for now)

Summarising what a potential solution to allowing moving link targets between pages, or making other changes (like updating section headings) without breaking deep links to those anchors:

a way to essentially do an "anchor diff" between two versions of a set of docs to find anchors and pages which used to exist but will no longer resolve (for example, define https://docs.python.org/dev/ as the reference docs for main, and compare each new build to those. It might be sufficient to use the existing intersphinx inventory as the basis for comparison)
a way to map removed anchors on affected pages to new targets (targets should be Sphinx semantic references)
when a page has an anchor map defined, inject the client side JS to intercept stale links and generate the relevant JS redirect request (if the page has no anchor map, there's no need to inject that JS snippet)
a docs CI check that fails if anchors are removed relative to the baseline docs without an anchor remap entry being defined

This is still @picnixz's "custom Sphinx extension" idea, just with a better idea of what that extension would need to offer to enable docs refactoring without worrying about breaking existing deep links. If this existed, my orphaned navigation hub idea wouldn't be needed.

barneygale · 2024-10-29T19:49:18Z

I like the idea of using the intersphinx data. Here's a script that uses sphobjinv to print links that have died in the 3.14 docs:

from sphobjinv.inventory import Inventory


def load(url):
    inv = Inventory(url=url)
    return {obj.uri_expanded for obj in inv.objects}


old_urls = load('https://docs.python.org/3.13/objects.inv')
new_urls = load('https://docs.python.org/3.14/objects.inv')
dead_urls = old_urls - new_urls

for url in sorted(dead_urls):
    print(url)

Current output

library/asyncio-policy.html#asyncio-watchers
library/asyncio-policy.html#asyncio.AbstractChildWatcher
library/asyncio-policy.html#asyncio.AbstractChildWatcher.add_child_handler
library/asyncio-policy.html#asyncio.AbstractChildWatcher.attach_loop
library/asyncio-policy.html#asyncio.AbstractChildWatcher.close
library/asyncio-policy.html#asyncio.AbstractChildWatcher.is_active
library/asyncio-policy.html#asyncio.AbstractChildWatcher.remove_child_handler
library/asyncio-policy.html#asyncio.AbstractEventLoopPolicy.get_child_watcher
library/asyncio-policy.html#asyncio.AbstractEventLoopPolicy.set_child_watcher
library/asyncio-policy.html#asyncio.FastChildWatcher
library/asyncio-policy.html#asyncio.MultiLoopChildWatcher
library/asyncio-policy.html#asyncio.PidfdChildWatcher
library/asyncio-policy.html#asyncio.SafeChildWatcher
library/asyncio-policy.html#asyncio.ThreadedChildWatcher
library/asyncio-policy.html#asyncio.get_child_watcher
library/asyncio-policy.html#asyncio.set_child_watcher
library/collections.abc.html#collections.abc.ByteString
library/dis.html#opcode-BEFORE_ASYNC_WITH
library/dis.html#opcode-BEFORE_WITH
library/dis.html#opcode-BUILD_CONST_KEY_MAP
library/dis.html#opcode-LOAD_ASSERTION_ERROR
library/dis.html#opcode-RETURN_CONST
library/json.html#cmdoption-json.tool-arg-infile
library/json.html#cmdoption-json.tool-arg-outfile
library/json.html#cmdoption-json.tool-h
library/json.html#cmdoption-json.tool-indent
library/json.html#cmdoption-json.tool-json-lines
library/json.html#cmdoption-json.tool-no-ensure-ascii
library/json.html#cmdoption-json.tool-sort-keys
library/sqlite3.html#sqlite3.version
library/sqlite3.html#sqlite3.version_info
library/subprocess.html#disable-vfork
library/typing.html#typing.ByteString
using/configure.html#cmdoption-without-freelists

barneygale · 2024-10-29T20:14:29Z

A very basic solution might be to redirect users to search.html, and supply the URL fragment as the search query. This would work OK for terms and python references, but not heading permalinks.

rhettinger · 2024-10-29T21:26:09Z

@nedbat Does the docs WG want to take a position with regard to docs stability versus refactoring into smaller chunks in hopes that SEO will be improved?

JelleZijlstra · 2024-10-30T00:05:40Z

I think this should be motivated not just by SEO, but also by improving the usability of the docs. It's a very large file that covers a lot of ground, and the way it's organized isn't necessarily the best. That may be bad for SEO, but it's also not ideal for human readers.

Currently the file has not just a discussion of Python's general "data model", the way data is represented, but also detailed documentation about some precise types, such as code objects. That documentation might fit better at https://docs.python.org/3/library/types.html#types.CodeType, so the data model page can focus more on behavior of the core language. Similarly, the data model page has discussion of numbers.Number and similar classes, which feels a bit out of place, as those are library ABCs, not core parts of the language. On the other hand, memoryview, a builtin, isn't mentioned as part of the "standard type hierarchy". Some of the file also duplicates the stdtypes page: compare https://docs.python.org/3/reference/datamodel.html#set-types and https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset.

I also agree we should avoid breaking links. If we want to be very strict in this, we could build some tooling that records e.g. all anchor targets in an old version of the docs and asserts they continue to work.

willingc · 2024-10-30T00:05:57Z

A few general considerations on splitting up a long page in the Language Reference (which this is). I'm speaking from my perspective and not for the entire @python/editorial-board. I would urge us to be more conservative with the Language Reference docs than the Library docs since it is the definition of the Python language.

Users experience and discoverability are more important than SEO.
SEO is important, but we should have a plan of how we will use the SEO information.
Like @hugovk mentions, changes, if made, would need to account for orphaned pages to keep existing links working.

willingc · 2024-10-30T00:08:18Z

Currently the file has not just a discussion of Python's general "data model", the way data is represented, but also detailed documentation about some precise types, such as code objects. That documentation might fit better at https://docs.python.org/3/library/types.html#types.CodeType, so the data model page can focus more on behavior of the core language.

@JelleZijlstra's example is in line with my thinking when it comes to Language Reference changes vs. Library Doc changes.

barneygale · 2024-10-30T00:25:54Z

Users experience and discoverability are more important than SEO.

To be clear, UX and discoverability are the entire reason I care about SEO here!

willingc · 2024-10-30T00:37:46Z

To be clear, UX and discoverability are the entire reason I care about SEO here!

I understand your intent. To restate, if improvements to SEO impact negatively UX and discoverability, we should pass until the negatives are mitigated. As an aside, the exclamation point wasn't necessary in the earlier response.

barneygale · 2024-10-30T00:43:52Z

Sorry!

nedbat · 2024-10-30T01:30:16Z

I think the page is too long, and would improve both UX and SEO to be split up. It sounds like there is probably a way to reasonably preserve old links, though that still needs some investigation. It's a big job that should be done with care.

ncoghlan · 2024-10-30T07:59:53Z

As there seems to be consensus that a technical improvement around preserving deep links is needed before we embark on any major layout changes, I filed that request as a docsbuild-scripts issue: python/docs-community#134 (even if using the technical solution ends up being a CPython change, creating that solution seemed more like a docs build question to me).

nedbat · 2024-10-30T10:31:08Z

Another good first step is making a concrete proposal about how the page would be split up. I know from my own work on the devguide that it's easy to look through an existing document and be certain that it could be reshaped into something better. When you actually sit down to do the reshaping, difficulties arise, decisions have to be made, and so on. Does someone want to write a doc somewhere that shows how a split page would be structured?

picnixz · 2024-10-30T12:03:27Z

My first impression is: split them by classes first. They are good on their own IMO. And each class can by regrouped by topic (e.g. strings, numerics, collections, etc). I can sketch a rough idea if you want (maybe by the end of the afternoon)

ncoghlan · 2024-10-30T13:37:12Z

I'm not sure about the Data Model page, but @nedbat's question prompted me to add a draft split for the builtin types page in #126052 (comment) (giving str its own page would also mean we could finally move the details of the format string syntax out of the string module docs).

willingc · 2024-10-30T14:45:58Z

Perhaps the most conservative first iteration after getting the linking resolved would be to split the doc where there are natural breaks: 3.1, 3.2, 3.3 and 3.4. This will keep familiarity initially, and it does not preclude us from further splitting classes and 3.2 in future iterations.

barneygale added the docs Documentation in the Doc dir label Oct 27, 2024

barneygale mentioned this issue Oct 27, 2024

"Built-in Types" page is too long #126052

Open

This comment was marked as outdated.

Sign in to view

ncoghlan mentioned this issue Oct 30, 2024

Preserve documentation deep links across layout changes python/docs-community#134

Open

willingc added this to docs issues Oct 31, 2024

github-project-automation bot moved this to Todo in docs issues Oct 31, 2024

picnixz mentioned this issue Dec 2, 2024

Improve the documentation of PEP 495 features without referencing the PEP #101235

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Data model" page is too long #126053

"Data model" page is too long #126053

barneygale commented Oct 27, 2024 •

edited

Loading

rhettinger commented Oct 28, 2024 •

edited

Loading

barneygale commented Oct 28, 2024

ncoghlan commented Oct 28, 2024

picnixz commented Oct 28, 2024 •

edited

Loading

barneygale commented Oct 28, 2024 •

edited

Loading

ncoghlan commented Oct 28, 2024

picnixz commented Oct 28, 2024 •

edited

Loading

hugovk commented Oct 28, 2024

ncoghlan commented Oct 28, 2024

barneygale commented Oct 29, 2024

barneygale commented Oct 29, 2024

This comment was marked as outdated.

rhettinger commented Oct 29, 2024

JelleZijlstra commented Oct 30, 2024

willingc commented Oct 30, 2024 •

edited

Loading

willingc commented Oct 30, 2024

barneygale commented Oct 30, 2024

willingc commented Oct 30, 2024

barneygale commented Oct 30, 2024

nedbat commented Oct 30, 2024

ncoghlan commented Oct 30, 2024

nedbat commented Oct 30, 2024

picnixz commented Oct 30, 2024

ncoghlan commented Oct 30, 2024

willingc commented Oct 30, 2024

"Data model" page is too long #126053

"Data model" page is too long #126053

Comments

barneygale commented Oct 27, 2024 • edited Loading

Documentation

rhettinger commented Oct 28, 2024 • edited Loading

barneygale commented Oct 28, 2024

ncoghlan commented Oct 28, 2024

picnixz commented Oct 28, 2024 • edited Loading

barneygale commented Oct 28, 2024 • edited Loading

ncoghlan commented Oct 28, 2024

picnixz commented Oct 28, 2024 • edited Loading

hugovk commented Oct 28, 2024

ncoghlan commented Oct 28, 2024

barneygale commented Oct 29, 2024

barneygale commented Oct 29, 2024

This comment was marked as outdated.

rhettinger commented Oct 29, 2024

JelleZijlstra commented Oct 30, 2024

willingc commented Oct 30, 2024 • edited Loading

willingc commented Oct 30, 2024

barneygale commented Oct 30, 2024

willingc commented Oct 30, 2024

barneygale commented Oct 30, 2024

nedbat commented Oct 30, 2024

ncoghlan commented Oct 30, 2024

nedbat commented Oct 30, 2024

picnixz commented Oct 30, 2024

ncoghlan commented Oct 30, 2024

willingc commented Oct 30, 2024

barneygale commented Oct 27, 2024 •

edited

Loading

rhettinger commented Oct 28, 2024 •

edited

Loading

picnixz commented Oct 28, 2024 •

edited

Loading

barneygale commented Oct 28, 2024 •

edited

Loading

picnixz commented Oct 28, 2024 •

edited

Loading

willingc commented Oct 30, 2024 •

edited

Loading