[Catalog-sig] V4 Pre-PEP: transition to release-file hosting on PYPI

M.-A. Lemburg mal at python.org
Fri Mar 15 16:47:34 CET 2013

Thanks, Holger. This version looks a lot better :-)

There are still some minor quirks which would need to be
addressed more explicitly, but overall, this proposal provides
a good way forward.

Perhaps it would also be possible to add the secured download
links and the caching/proxying ideas to the PEP at some point,
or we turn those into a new PEP.

I can't follow up in detail today, but will have a closer look
next week.

On 15.03.2013 10:29, holger krekel wrote:
> Hi all, in particular Philip, Marc-Andre, Donald,
> Carl and me decided to simplify the PEP and avoid the somewhat
> awkward ``simple/-with-externals`` index for various reasons, among them
> Marc-Andre's criticisms.  This also means present-day installation tools
> (shipped with Redhat/Debian/etc.) will continue to work as today for
> those packages which remain in a hosting-mode that requires crawling and
> scraping.  They will still benefit from the fact that most packages will
> soon have a hosting-mode that avoids it.  Future releases of installation
> tools will default to not perform crawling or using (scraped) external
> links, and new PYPI projects will default to only serve uploaded files.
> The V4 pre-PEP also renames the three PyPI hosting modes to be more
> descriptive. Since all three modes allow external links, "pypi-ext" vs
> "pypi-only" were misleading. The new naming distinguishes the mode that both
> scrapes links from metadata and crawls external pages for more links
> ("pypi-scrape-crawl") from the mode that only scrapes links from metadata
> ("pypi-scrape") from the mode where all links are explicit ("pypi-explicit").
> Without the separate external index, it also turns out that the two transition
> phases are separated into PyPI changes (phase one) and installer-tool
> updates (phase two). There are no PyPI changes necessary in phase two.
> As stated in a new open question, it should be possible to do 
> PEP-related installation tool updates during phase 1, that may require
> a bit of clarification in the PEP's language still.
> Carl and me are happy with this PEP version now and hope you all are as
> well.  Donald is already working on improving the analysis tool so
> we hopefully have some updated numbers soon.
> cheers,
> Holger
> Title: Transitioning to release-file hosting on PyPI
> Version: $Revision$
> Last-Modified: $Date$
> Author: Holger Krekel <holger at merlinux.eu>, Carl Meyer <carl at oddbird.net>
> Discussions-To: catalog-sig at python.org
> Status: Draft (PRE-submit V4)
> Type: Process
> Content-Type: text/x-rst
> Created: 10-Mar-2013
> Post-History:
> Abstract
> ========
> This PEP proposes a backward-compatible two-phase transition process
> to speed up, simplify and robustify installing from the
> pypi.python.org (PyPI) package index.  To ease the transition and
> minimize client-side friction, **no changes to distutils or existing
> installation tools are required in order to benefit from the first
> transition phase, which will result in faster, more reliable installs
> for most existing packages**.
> The first transition phase implements an easy and explicit means for a
> package maintainer to control which release file links are served to
> present-day installation tools.  The first phase also includes the
> implementation of analysis tools for present-day packages, to support
> communication with package maintainers and the automated setting of
> default modes for controlling release file links.  The first phase
> also will make new projects on PYPI use a default to only serve 
> links to release files which were uploaded to PYPI.
> The second transition phase concerns end-user installation tools,
> which shall default to only install release files that are hosted on
> PyPI and tell the user if external release files exist, offering
> a choice to automatically use those external files.
> Rationale
> =========
> .. _history:
> History and motivations for external hosting
> --------------------------------------------
> When PyPI went online, it offered release registration but had no
> facility to host release files itself.  When hosting was added, no
> automated downloading tool existed yet.  When Philip Eby implemented
> automated downloading (through setuptools), he made the choice to
> allow people to use download hosts of their choice.  The finding of
> externally-hosted packages was implemented as follows:
> #. The PyPI ``simple/`` index for a package contains all links found
>    by scraping them from that package's long_description metadata for 
>    any release. Links in the "Download-URL" and "Home-page" metadata
>    fields are given ``rel=download`` and ``rel=homepage`` attributes,
>    respectively.
> #. Any of these links whose target is a file whose name appears to be
>    in the form of an installable source or binary distribution, with
>    name in the form "packagename-version.ARCHIVEEXT", is considered a
>    potential installation candidate by installation tools.
> #. Similarly, any links suffixed with an "#egg=packagename-version"
>    fragment are considered an installation candidate.
> #. Additionally, the ``rel=homepage`` and ``rel=download`` links are
>    crawled by installation tools and, if HTML, are themselves scraped
>    for release-file links in the above formats.
> Today, most packages released on PyPI host their release files on
> PyPI, but a small percentage (XXX need updated data) rely on external
> hosting.
> There are many reasons [2]_ why people have chosen external
> hosting. To cite just a few:
> - release processes and scripts have been developed already and upload
>   to external sites
> - it takes too long to upload large files from some places in the
>   world
> - export restrictions e.g. for crypto-related software
> - company policies which require offering open source packages
>   through own sites
> - problems with integrating uploading to PyPI into one's release
>   process (because of release policies)
> - desiring download statistics different from those maintained by PyPI
> - perceived bad reliability of PyPI
> - not aware that PyPI offers file-hosting
> Irrespective of the present-day validity of these reasons, there
> clearly is a history why people choose to host files externally and it
> even was for some time the only way you could do things.  This PEP
> takes the position that there are at least some valid reasons for
> external hosting.
> Problem
> -------
> **Today, python package installers (pip, easy_install, buildout, and
> others) often need to query many non-PyPI URLs even if there are no
> externally hosted files**.  Apart from querying pypi.python.org's
> simple index pages, also all homepages and download pages ever
> specified with any release of a package are crawled by an installer.
> The need for installers to crawl external sites slows down
> installation and makes for a brittle and unreliable installation
> process.  Those sites and packages also don't take part in the
> :pep:`381` mirroring infrastructure, further decreasing reliability
> and speed of automated installation processes around the world.
> Most packages are hosted directly on pypi.python.org [1]_.  Even for
> these packages, installers still crawl their homepage and
> download-url, if specified.  Many package uploaders are not aware that
> specifying the "homepage" or "download-url" in their package metadata
> will needlessly slow down the installation process for all users.
> Relying on third party sites also opens up more attack vectors for
> injecting malicious packages into sites using automated installs.  A
> simple attack might just involve getting hold of an old now-unused
> homepage domain and placing malicious packages there.  Moreover,
> performing a Man-in-The-Middle (MITM) attack between an installation
> site and any of the download sites can inject malicious packages on
> the installation site.  As many homepages and download locations are
> using HTTP and not HTTPS, such attacks are not hard to launch.  Such
> MITM attacks can easily happen even for packages which never intended
> to host files externally as their homepages are contacted by
> installers anyway.
> There is currently no way for package maintainers to avoid
> external-link crawling, other than removing all homepage/download url
> metadata for all historic releases.  While a script [3]_ has been
> written to perform this action, it is not a good general solution
> because it removes useful metadata from PyPI releases.
> Even if the sites referenced by "Homepage" and "Download-URL" links were 
> not scraped for further links, there is no obvious way under the current
> system for a package owner to link to an installable file from a 
> long_description metadata field (which is shown as package documentation
> on ``/pypi/PKG``) without installation tools automatically considering
> that file a candidate for installation.  Conversely, there is no way
> to explicitely register multiple external release files without 
> putting them in metadata fields.
> Goals
> -----
> These are the goals to be achieved by implementation of this PEP:
> * Package owners should be able to explicitly control which files are
>   presented by PyPI to installer tools as installation
>   candidates. Installation should not be slowed and made less reliable
>   by extensive and unnecessary crawling of links that package owners
>   did not explicitly nominate as installation files.
> * It should remain possible for package owners to choose to host their
>   release files on their own hosting, external to PyPI. It should be
>   easy for a user to request the installation of such releases using
>   automated installer tools.
> * Automated installer tools should not install externally-hosted
>   packages **by default**, but only when explicitly authorized to do
>   so by the user. When tools refuse to install such a package by
>   default, they should tell the user exactly which external link(s)
>   they would need to follow, and what option(s) the user can provide
>   to authorize the tool to follow those links. PyPI should provide all
>   necessary metadata for installer tools to implement this easily
>   and within a single request/reply interaction.
> * Migration from the status quo to the above points should be gradual
>   and minimize breakage. This includes tooling that makes it easy for
>   package owners with an existing release process that uploads to
>   non-PyPI hosting to also upload those release files to PyPI.  
> Solution / two transition phases
> ================================
> The first transition phase introduces a "hosting-mode" field for each
> project on PyPI, allowing package owners explicit control of which
> release file links are served to present-day installation tools in the
> machine-readable ``simple/`` index. The first transition will, after
> successful hosting-mode manipulations by individual early-adopters,
> set a default hosting mode for existing packages, based on
> automated analysis.  **Maintainers will be notified one month ahead of
> any such automated change**.  At completion of the first transition
> phase, **all present-day existing release and installation processes
> and tools are expected to continue working**.  Any remaining errors or
> problems are expected to only relate to installation of individual
> packages and can be easily corrected by package maintainers or PyPI
> admins if maintainers are not reachable.
> Also in the first phase, each link served in the ``simple/`` index
> will be explicitly marked as ``rel="internal"`` (hosted by the index
> itself) or ``rel="external"`` (linking to an external site that is not
> part of the index).
> In the second transition phase, PyPI client installation tools shall
> be updated to default to only install ``rel="internal"`` packages
> unless a user specifies option(s) to permit installing from external
> links.
> Maintainers of packages which currently host release files on non-PyPI
> sites shall receive instructions and tools to ease "re-hosting" of
> their historic and future package release files.  This re-hosting tool
> MUST be available before automated hosting-mode changes are announced
> to package maintainers.
> Implementation
> ==============
> Hosting modes
> -------------
> The foundation of the first transition phase is the introduction of
> three "modes" of PyPI hosting for a package, affecting which links are
> generated for the ``simple/`` index.  These modes are implemented
> without requiring changes to installation tools via changes to the
> algorithm for generating the machine-readable ``simple/`` index.
> The modes are:
> - ``pypi-scrape-crawl``: no change from the current situation of
>   generating machine-readable links for installation tools, as
>   outlined in the history_.
> - ``pypi-scrape``: for a package in this mode, links to be added to
>   the ``simple/`` index are still scraped from package
>   metadata. However, the "Home-page" and "Download-url" links are
>   given ``rel=ext-homepage`` and ``rel=ext-download`` attributes
>   instead of ``rel=homepage`` and ``rel=download``. The effect of this
>   (with no change in installation tools necessary) is that these links
>   will not be followed and scraped for further candidate links by present-day
>   installation tools: only installable files directly hosted from PYPI or
>   linked directly from PyPI metadata will be considered for installation.
>   Installation tools MAY evolve to offer an option to use the new 
>   rel-attribution to crawl external pages but MUST NOT default to it.
> - ``pypi-explicit``: for a package in this mode, only links to release
>   files uploaded to PyPI, and external links to release files
>   explicitly nominated by the package owner (via a new interface
>   exposed by PyPI) will be added to the ``simple/`` index.
> Thus the hope is that eventually all projects on PyPI can be migrated
> to the ``pypi-explicit`` mode, while preserving the ability to install
> release files hosted externally via installer tools. Deprecation of
> hosting modes to eventually only allow the ``pypi-explicit`` mode is
> NOT REGULATED by this PEP but is expected to become feasible some time
> after successful implementation of the transition phases described in
> this PEP.  It is expected that deprecation requires **a new process to deal 
> with abandoned packages** because of unreachable maintainers for still
> popular packages.
> First transition phase (PyPI)
> -----------------------------
> The proposed solution consists of multiple implementation and
> communication steps:
> #. Implement in PyPI the three modes described above, with an
>    interface for package owners to select the mode for each package
>    and register explicit external file URLs.
> #. For packages in all modes, label all links in the ``simple/`` index
>    with ``rel="internal"`` or ``rel="external"``, to make it easier
>    for client tools to distinguish the types of links in the second
>    transition phase.
> #. Default all newly-registered packages to ``pypi-explicit`` mode
>    (package owners can still switch to the other modes as desired).
> #. Determine (via an automated analysis tool) which packages have all
>    installable files available on PyPI itself (group A), which have
>    all installable files linked directly from PyPI metadata (group B),
>    and which have installable versions available that are linked only
>    from external homepage/download HTML pages (group C).
> #. Send mail to maintainers of projects in group A that their project
>    will be automatically configured to ``pypi-explicit`` mode in one
>    month, and similarly to maintainers of projects in group B that
>    their project will be automatically configured to ``pypi-scrape``
>    mode.  Inform them that this change is not expected to affect
>    installability of their project at all, but will result in faster
>    and safer installs for their users.  Encourage them to set this
>    mode themselves sooner to benefit their users.
> #. Send mail to maintainers of packages in group C that their package
>    hosting mode is ``pypi-scrape-crawl``, list the URLs which
>    currently are crawled, and suggest that they either re-host their
>    packages directly on PyPI and switch to ``pypi-explicit``, or at
>    least provide direct links to release files in PyPI metadata and
>    switch to ``pypi-scrape``.  Provide instructions and tools to help
>    with these transitions.
> Second transition phase (installer tools)
> -----------------------------------------
> For the second transition phase, maintainers of installation tools are
> asked to release two updates. 
> The first update shall provide clear warnings if externally-hosted
> release files (that is, files whose link is ``rel="external"``) are
> selected for download, for which projects and URLs exactly this
> happens, and warn that in future versions externally-hosted downloads
> will be disabled by default.
> The second update should change the default mode to allow only
> installation of ``rel="internal"`` package files, and allow
> installation of externally-hosted packages only when the user supplies
> an option (ideally an option specifying exactly which external domains
> are to be trusted as download sources). When download of an
> externally-hosted package is disallowed, the user should be notified,
> with instructions for how to make the install succeed and warnings
> about the implication (that a file will be downloaded from a site that
> is not part of the package index).
> Open questions / Tasks
> ===========================
> - Should we introduce some form of PyPI API versioning in this PEP?
>   (it might complicate matters and delay the implementation but is
>   often seen as good practise).
> - in pypi-scrape mode: does PYPI determine itself what are installation
>   candidates and avoids presenting other random links (which are currently
>   served)?
> - consider that installation tools may choose to release updates 
>   during transition phase 1 already, to warn about crawling and scraped
>   links (which are easily identifiable today and after the new rel-attribution
>   after transition phase 1).
> References
> ==========
> .. [1] Donald Stufft, ratio of externally hosted versus pypi-hosted, http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html (XXX need to update this data for all easy_install-supported formats)
> .. [2] Marc-Andre Lemburg, reasons for external hosting, http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html
> .. [3] Holger Krekel, Script to remove homepage/download metadata for all releases http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html
> Acknowledgments
> ================
> Philip Eby for precise information and the basic ideas to implement
> the transition via server-side changes only.
> Donald Stufft for pushing away from external hosting and offering to
> implement both a Pull Request for the necessary PyPI changes and the
> analysis tool to drive the transition phase 1.
> Marc-Andre Lemburg, Nick Coghlan and catalog-sig in general for
> thinking through issues regarding getting rid of "external hosting".
> Copyright
> =========
> This document has been placed in the public domain.
> ..
>    Local Variables:
>    mode: indented-text
>    indent-tabs-mode: nil
>    sentence-end-double-space: t
>    fill-column: 70
>    coding: utf-8
>    End:
> _______________________________________________
> Catalog-SIG mailing list
> Catalog-SIG at python.org
> http://mail.python.org/mailman/listinfo/catalog-sig

Marc-Andre Lemburg
PSF Vice Chairman

More information about the Catalog-SIG mailing list