[Catalog-sig] Migrating away from scanning home pages (was: Deprecate External Links)

Thu Feb 28 19:58:44 CET 2013

On Thu, Feb 28, 2013 at 5:55 AM, M.-A. Lemburg <mal at egenix.com> wrote:
> I think we all agree that scanning arbitrary HTML pages
> for download links is not a good idea and we need to
> transition away from this towards a more reliable system.
>
> Here's an approach that would work to start the transition
> while not breaking old tools (sketching here to describe the
> basic idea):
>
> Limiting scans to download_url
> ------------------------------
>
> Installers and similar tools preferably no longer scan the all
> links on the /simple/ index, but instead only look at
> the download links (which can be defined in the package
> meta data) for packages that don't host files on PyPI.
>
> Going only one level deep
> -------------------------
>
> If the download links point to a meta-file named
> "<packagename>-<version>-downloads.html#<sha256-hashvalue>",
> the installers download that file, check whether the
> hash value matches and if it does, scan the file in
> the same way they would parse the /simple/ index page of
> the package - think of the downloads.html file as a symlink
> to extend the search to an external location, but in a
> predefined and safe way.

Clever.  This is actually backward compatible with existing tools, in
that they will read this file right now.  The hashing and verification
isn't supported, but we could add warnings to do it.

Actually, the essence of your idea can be done even more simply: just
require that the link include a hash that the fetched page will be
verified against.  It essentially ensures that stale external links
can't break anything.

Further, since the existence of the hash means that the page can't be
changed without changing the URL, it means that PyPI *itself* can
simply fetch it once, parse the links from it, and serve them directly
on the /simple index page.  If you change the download URL, PyPI
discards the previous links and redoes the scan.

All in all, though, I'm not sure it's as viable as a simple "upload my
external release" button (in the UI) and matching setup.py command
(for automation) as a way of getting people's releases done.  It seems
like builidng a downloads.html for your files from SourceForge, say,
would be just an annoying intermediate step.

(This is assuming, of course, that the licensing issues can be worked out.)

> * In a later phase of the transition we could have
>   PyPI cache the referenced distribution files locally
>   to improve reliability. This would turn the push
>   strategy for uploading files to PyPI into a pull
>   strategy for those packages and make things a lot
>   easier to handle for package maintainers.

I like this part.  I think we should just go straight there, and skip
the intermediate link formatting stuff.  ;-)