Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Data model" page is too long #126053

Open
barneygale opened this issue Oct 27, 2024 · 25 comments
Open

"Data model" page is too long #126053

barneygale opened this issue Oct 27, 2024 · 25 comments
Labels
docs Documentation in the Doc dir

Comments

@barneygale
Copy link
Contributor

barneygale commented Oct 27, 2024

Documentation

The Data model document is very long, and as a result it basically never shows up in search engine results, because 90% of the page is considered irrelevant for any query like "python __hash__".

I suggest we split it up by top-level topic, e.g. we add a dedicated page for "Special method names".

See also #126052

@barneygale barneygale added the docs Documentation in the Doc dir label Oct 27, 2024
@rhettinger
Copy link
Contributor

rhettinger commented Oct 28, 2024

Try not to break all the external links going into the pages. We don't want to invalidate all the references from blogs, tweets, stackoverflow answers, etc.

With regard to search engine results, I don't think we can or should engage in SEO. There is no promise that rearrangements will lead to being a top hit for a search.

@barneygale
Copy link
Contributor Author

Try not to break all the external links going into the pages. We don't want to invalidate all the references from blogs, tweets, stackoverflow answers, etc.

Presumably this is impossible, right @picnixz?

@ncoghlan
Copy link
Contributor

My suggestion for refactoring these large pages while mitigating the damage to existing deep links was to:

  1. Make the existing page name an orphan that exists solely as a navigation page to get from stale deep links to updated semantic references
  2. Ensure the new subfolder name can exist in parallel with the old file name (e.g. data_model/ in this case)

The damage to existing deep links that can't (or won't) be changed is still a good reason to tread carefully, but never being able to split pages as they grow over time isn't a great situation either.

For more background on why we should preserve link integrity as much as we can, the World Wide Web Consortium has a decent page here on why "Cool URIs Don't Change": https://www.w3.org/Provider/Style/URI

@picnixz
Copy link
Contributor

picnixz commented Oct 28, 2024

Presumably this is impossible, right

Mmmh. It could be possible actually but this would require a custom Sphinx extension and custom redirection at the nginx / apache level where old URLs would redirect to new ones (the Sphinx extension will be used to extract the mapping). It's also a bit of a hacky solution but I don't have a better alternative (a pure Sphinx solution may not be possible because we don't want a dead link if an article cites something like https://docs.python.org/3/reference/datamodel.html#numbers-number; auto-generated doc using :class:`numbers.Number` would be fine since the intersphinx inventory would be updated but raw links won't).

If you want to improve SEO, isn't there a way to indicate in an HTML document that this or that text is more important than something else (e.g., with some aria label or whatever HTML feature we may have)?

More generally, if you want to split the HTML, it's more of a server-side issue rather than a Sphinx issue (where the server would redirect to the appropriate page). So some redirect rules will need to be rewritten (and I don't know how much it could slow down the entire docs website).


Alyssa's suggestion on having a page serving as a hub is possible but it will be a bit ugly (because we still need to make all possible anchors available on that page so that users can re-click on them to have the expanded content).

@barneygale
Copy link
Contributor Author

barneygale commented Oct 28, 2024

Alyssa's suggestion on having a page serving as a hub is possible but it will be a bit ugly (because we still need to make all possible anchors available on that page so that users can re-click on them to have the expanded content).

Could the Sphinx extension glue together several pages to form datamodel.html? It would resemble the existing page (perhaps with a small amount of jankyness), but it would be an "orphan" page with no incoming links from the rest of the Python docs. At the top we could add a banner:

The Python data model documentation has been split into several chapters. This page combines those chapters into a single document; it exists solely to keep existing links working.

@ncoghlan
Copy link
Contributor

The original Py2-as-default -> Py3-as-default in https://peps.python.org/pep-0430/ was certainly all server-side redirect config. And yeah, I agree the orphaned navigation page isn't a good solution, it's just a better option than leaving people with either a 404 or an unanchored link to the start of a page with less inline content.

Unfortunately, web server rewrite rules can't help us here, as the anchor tag part is never sent to the server - it's handled by the browser after downloading the page. HTTP redirects don't help either, as they also operate at the page level.

It should be possible to do something clever with client side JavaScript: https://stackoverflow.com/questions/1305211/javascript-to-redirect-from-anchor-to-a-separate-page (and that could potentially be extended further to handle smaller cases like the deep links I recently broke by moving the Py_Main C API docs to a different page in #78387).

@picnixz
Copy link
Contributor

picnixz commented Oct 28, 2024

Could the Sphinx extension glue together several pages to form datamodel.html

If you're worried about the length of datamodel.rst, then you can do it natively using .. include:: directives.


Ah yes, I forgotten about the redirection using JS. I was confused because I actually thought about server-side rendering. Now using JS can be integrated in Sphinx directly (IIRC).

@hugovk
Copy link
Member

hugovk commented Oct 28, 2024

If you want to improve SEO, isn't there a way to indicate in an HTML document that this or that text is more important than something else (e.g., with some aria label or whatever HTML feature we may have)?

We've no way of knowing which of the 18k words (or 25k in #126052) is the important text that any given visitor is interested in. That's why more granular pages will help.

@ncoghlan
Copy link
Contributor

(We may want to break out a separate pre-requisite issue for this, but continuing here for now)

Summarising what a potential solution to allowing moving link targets between pages, or making other changes (like updating section headings) without breaking deep links to those anchors:

  • a way to essentially do an "anchor diff" between two versions of a set of docs to find anchors and pages which used to exist but will no longer resolve (for example, define https://docs.python.org/dev/ as the reference docs for main, and compare each new build to those. It might be sufficient to use the existing intersphinx inventory as the basis for comparison)
  • a way to map removed anchors on affected pages to new targets (targets should be Sphinx semantic references)
  • when a page has an anchor map defined, inject the client side JS to intercept stale links and generate the relevant JS redirect request (if the page has no anchor map, there's no need to inject that JS snippet)
  • a docs CI check that fails if anchors are removed relative to the baseline docs without an anchor remap entry being defined

This is still @picnixz's "custom Sphinx extension" idea, just with a better idea of what that extension would need to offer to enable docs refactoring without worrying about breaking existing deep links. If this existed, my orphaned navigation hub idea wouldn't be needed.

@barneygale
Copy link
Contributor Author

I like the idea of using the intersphinx data. Here's a script that uses sphobjinv to print links that have died in the 3.14 docs:

from sphobjinv.inventory import Inventory


def load(url):
    inv = Inventory(url=url)
    return {obj.uri_expanded for obj in inv.objects}


old_urls = load('https://docs.python.org/3.13/objects.inv')
new_urls = load('https://docs.python.org/3.14/objects.inv')
dead_urls = old_urls - new_urls

for url in sorted(dead_urls):
    print(url)
Current output
library/asyncio-policy.html#asyncio-watchers
library/asyncio-policy.html#asyncio.AbstractChildWatcher
library/asyncio-policy.html#asyncio.AbstractChildWatcher.add_child_handler
library/asyncio-policy.html#asyncio.AbstractChildWatcher.attach_loop
library/asyncio-policy.html#asyncio.AbstractChildWatcher.close
library/asyncio-policy.html#asyncio.AbstractChildWatcher.is_active
library/asyncio-policy.html#asyncio.AbstractChildWatcher.remove_child_handler
library/asyncio-policy.html#asyncio.AbstractEventLoopPolicy.get_child_watcher
library/asyncio-policy.html#asyncio.AbstractEventLoopPolicy.set_child_watcher
library/asyncio-policy.html#asyncio.FastChildWatcher
library/asyncio-policy.html#asyncio.MultiLoopChildWatcher
library/asyncio-policy.html#asyncio.PidfdChildWatcher
library/asyncio-policy.html#asyncio.SafeChildWatcher
library/asyncio-policy.html#asyncio.ThreadedChildWatcher
library/asyncio-policy.html#asyncio.get_child_watcher
library/asyncio-policy.html#asyncio.set_child_watcher
library/collections.abc.html#collections.abc.ByteString
library/dis.html#opcode-BEFORE_ASYNC_WITH
library/dis.html#opcode-BEFORE_WITH
library/dis.html#opcode-BUILD_CONST_KEY_MAP
library/dis.html#opcode-LOAD_ASSERTION_ERROR
library/dis.html#opcode-RETURN_CONST
library/json.html#cmdoption-json.tool-arg-infile
library/json.html#cmdoption-json.tool-arg-outfile
library/json.html#cmdoption-json.tool-h
library/json.html#cmdoption-json.tool-indent
library/json.html#cmdoption-json.tool-json-lines
library/json.html#cmdoption-json.tool-no-ensure-ascii
library/json.html#cmdoption-json.tool-sort-keys
library/sqlite3.html#sqlite3.version
library/sqlite3.html#sqlite3.version_info
library/subprocess.html#disable-vfork
library/typing.html#typing.ByteString
using/configure.html#cmdoption-without-freelists

@barneygale
Copy link
Contributor Author

A very basic solution might be to redirect users to search.html, and supply the URL fragment as the search query. This would work OK for terms and python references, but not heading permalinks.

@picnixz

This comment was marked as outdated.

@rhettinger
Copy link
Contributor

@nedbat Does the docs WG want to take a position with regard to docs stability versus refactoring into smaller chunks in hopes that SEO will be improved?

@JelleZijlstra
Copy link
Member

I think this should be motivated not just by SEO, but also by improving the usability of the docs. It's a very large file that covers a lot of ground, and the way it's organized isn't necessarily the best. That may be bad for SEO, but it's also not ideal for human readers.

Currently the file has not just a discussion of Python's general "data model", the way data is represented, but also detailed documentation about some precise types, such as code objects. That documentation might fit better at https://docs.python.org/3/library/types.html#types.CodeType, so the data model page can focus more on behavior of the core language. Similarly, the data model page has discussion of numbers.Number and similar classes, which feels a bit out of place, as those are library ABCs, not core parts of the language. On the other hand, memoryview, a builtin, isn't mentioned as part of the "standard type hierarchy". Some of the file also duplicates the stdtypes page: compare https://docs.python.org/3/reference/datamodel.html#set-types and https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset.

I also agree we should avoid breaking links. If we want to be very strict in this, we could build some tooling that records e.g. all anchor targets in an old version of the docs and asserts they continue to work.

@willingc
Copy link
Contributor

willingc commented Oct 30, 2024

A few general considerations on splitting up a long page in the Language Reference (which this is). I'm speaking from my perspective and not for the entire @python/editorial-board. I would urge us to be more conservative with the Language Reference docs than the Library docs since it is the definition of the Python language.

  1. Users experience and discoverability are more important than SEO.
  2. SEO is important, but we should have a plan of how we will use the SEO information.
  3. Like @hugovk mentions, changes, if made, would need to account for orphaned pages to keep existing links working.

@willingc
Copy link
Contributor

Currently the file has not just a discussion of Python's general "data model", the way data is represented, but also detailed documentation about some precise types, such as code objects. That documentation might fit better at https://docs.python.org/3/library/types.html#types.CodeType, so the data model page can focus more on behavior of the core language.

@JelleZijlstra's example is in line with my thinking when it comes to Language Reference changes vs. Library Doc changes.

@barneygale
Copy link
Contributor Author

Users experience and discoverability are more important than SEO.

To be clear, UX and discoverability are the entire reason I care about SEO here!

@willingc
Copy link
Contributor

To be clear, UX and discoverability are the entire reason I care about SEO here!

I understand your intent. To restate, if improvements to SEO impact negatively UX and discoverability, we should pass until the negatives are mitigated. As an aside, the exclamation point wasn't necessary in the earlier response.

@barneygale
Copy link
Contributor Author

Sorry!

@nedbat
Copy link
Member

nedbat commented Oct 30, 2024

I think the page is too long, and would improve both UX and SEO to be split up. It sounds like there is probably a way to reasonably preserve old links, though that still needs some investigation. It's a big job that should be done with care.

@ncoghlan
Copy link
Contributor

As there seems to be consensus that a technical improvement around preserving deep links is needed before we embark on any major layout changes, I filed that request as a docsbuild-scripts issue: python/docs-community#134 (even if using the technical solution ends up being a CPython change, creating that solution seemed more like a docs build question to me).

@nedbat
Copy link
Member

nedbat commented Oct 30, 2024

Another good first step is making a concrete proposal about how the page would be split up. I know from my own work on the devguide that it's easy to look through an existing document and be certain that it could be reshaped into something better. When you actually sit down to do the reshaping, difficulties arise, decisions have to be made, and so on. Does someone want to write a doc somewhere that shows how a split page would be structured?

@picnixz
Copy link
Contributor

picnixz commented Oct 30, 2024

My first impression is: split them by classes first. They are good on their own IMO. And each class can by regrouped by topic (e.g. strings, numerics, collections, etc). I can sketch a rough idea if you want (maybe by the end of the afternoon)

@ncoghlan
Copy link
Contributor

I'm not sure about the Data Model page, but @nedbat's question prompted me to add a draft split for the builtin types page in #126052 (comment) (giving str its own page would also mean we could finally move the details of the format string syntax out of the string module docs).

@willingc
Copy link
Contributor

Perhaps the most conservative first iteration after getting the linking resolved would be to split the doc where there are natural breaks: 3.1, 3.2, 3.3 and 3.4. This will keep familiarity initially, and it does not preclude us from further splitting classes and 3.2 in future iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir
Projects
Status: Todo
Development

No branches or pull requests

8 participants