[Catalog-sig] V3 PEP-draft for transitioning to pypi-hosting of release files
holger krekel
holger at merlinux.eu
Wed Mar 13 12:21:59 CET 2013
Hi all,
after some more discussions and hours spend by Carl Meyer (who is now
co-authoring the PEP) and me, here is a new V3 pre-submit draft.
It is now more ambitious than the previous draft as should be obvious
from the modified abstract (and Carl Meyers and Philip's earlier
interactions on this list). There also are more details of how
the current link-scraping works among other improvements and incorporations
of feedback from discussions here.
We intend to submit this draft tonight to the PEP editors.
Feedback now and later remains welcome. I am sure there are issues to
be sorted and clarified, among them the versioning-API suggestion by
Marc-Andre.
Thanks for everybody's support and feedback so far,
holger
PEP: XXX
Title: Transitioning to release-file hosting on PyPI
Version: $Revision$
Last-Modified: $Date$
Author: Holger Krekel <holger at merlinux.eu>, Carl Meyer <carl at oddbird.net>
Discussions-To: catalog-sig at python.org
Status: Draft (PRE-submit V3)
Type: Process
Content-Type: text/x-rst
Created: 10-Mar-2013
Post-History:
Abstract
========
This PEP proposes a backward-compatible two-phase transition process to speed
up, simplify and robustify installing from the pypi.python.org (PyPI)
package index. To ease the transition and minimize client-side
friction, **no changes to distutils or existing installation tools are
required in order to benefit from the transition phases, which is to
result in faster, more reliable installs for most existing packages**.
The first transition phase implements easy and explicit means for
a package maintainter to control which release file links are
served to present-day installation tools. The first phase also
includes the implementation of analysis tools for present-day packages,
to support communication with package maintainers and the automated
setting of default modes for controling release file links.
The second transition phase will result in the current PYPI index
to only serve PYPI-hosted files by default. Externally hosted files
will still be automatically discoverable through a second index.
Present-day installation tools will be able to continue working
by specifying this second index. New versions of installation
tools shall default to only install packages from PYPI unless
the user explicitely wishes to include non-PYPI sites.
Rationale
=========
.. _history:
History and motivations for external hosting
--------------------------------------------
When PyPI went online, it offered release registration but had no
facility to host release files itself. When hosting was added, no
automated downloading tool existed yet. When Philip Eby implemented
automated downloading (through setuptools), he made the choice to
allow people to use download hosts of their choice. The finding of
externally-hosted packages was implemented as follows:
#. The PyPI ``simple/`` index for a package contains all links found
anywhere in that package's metadata for any release. Links in the
"Download-URL" and "Home-page" metadata fields are given
``rel=download`` and ``rel=homepage`` attributes, respectively.
#. Any of these links whose target is a file whose name appears to be
in the form of an installable source or binary distribution, with
basename in the form "packagename-version.ARCHIVEEXT", is considered
a potential installation candidate.
#. Similarly, any links suffixed with an "#egg=packagename-version"
fragment are considered an installation candidate.
#. Additionally, the ``rel=homepage`` and ``rel=download`` links are
followed and, if HTML, are themselves scraped for release-file links
in the above formats.
Today, most packages released on PyPI host their release files on
PyPI, but a small percentage (XXX need updated data) rely on external
hosting.
There are many reasons [2]_ why people have chosen external
hosting. To cite just a few:
- release processes and scripts have been developed already and upload
to external sites
- it takes too long to upload large files from some places in the
world
- export restrictions e.g. for crypto-related software
- company policies which require offering open source packages
through own sites
- problems with integrating uploading to PYPI into one's release
process (because of release policies)
- desiring download statistics different from those maintained by PyPI
- perceived bad reliability of PYPI
- not aware that PyPI offers file-hosting
Irrespective of the present-day validity of these reasons, there
clearly is a history why people choose to host files externally and it
even was for some time the only way you could do things.
Problem
-------
**Today, python package installers (pip, easy_install, buildout, and
others) often need to query many non-PyPI URLs even if there are no
externally hosted files**. Apart from querying pypi.python.org's
simple index pages, also all homepages and download pages ever
specified with any release of a package are crawled by an installer.
The need for installers to crawl external sites slows down
installation and makes for a brittle and unreliable installation
process. Those sites and packages also don't take part in the
:pep:`381` mirroring infrastructure, further decreasing reliability
and speed of automated installation processes around the world.
Most packages are hosted directly on pypi.python.org [1]_. Even for
these packages, installers still crawl the homepage(s) of a package.
Many package uploaders are not aware that specifying the "homepage" in
their release process will slow down the installation process for all
users.
Relying on third party sites also opens up more attack vectors for
injecting malicious packages into sites using automated installs. A
simple attack might just involve getting hold of an old now-unused
homepage domain and placing malicious packages there. Moreover,
performing a Man-in-The-Middle (MITM) attack between an installation
site and any of the download sites can inject malicious packages on
the installation site. As many homepages and download locations are
using HTTP and not HTTPS, such attacks are not hard to launch. Such
MITM attacks can easily happen even for packages which never intended
to host files externally as their homepages are contacted by
installers anyway.
There is currently no way for package maintainers to avoid 3rd party
crawling, other than removing all homepage/download url metadata for
all historic releases. While a script [3]_ has been written to
perform this action, it is not a good general solution because it
removes semantic information like the "homepage" specification from
PYPI packages.
Even if the "Homepage" and "Download-URL" links were not scraped for
further links, there is still no way under the current system for a
package owner to link to an installable file from their package
metadata without installation tools automatically considering that
file a candidate for installation.
Solution / two transition phases
================================
This first transition phase starts off by introducing a "hosting-mode"
field for each project on PYPI, allowing explicit control of which
machine-readable release file links are served to present-day
installation tools. The first transition will, after successful
hosting-mode manipulations of individual early-adopters, then set a
default hosting mode for existing packages, based on automated anaylsis.
**Maintainers will be notified one month ahead of any such automated
change**. At completion of the first transition phase, **all
present-day existing release and installation processes and tools are
expected to continue working**. Any remaining errors or problems are
expected to only relate to installation of individual packages and can
be easily corrected by package maintainers or PYPI admins if maintainers
are not reachable.
**The second transition phase will then get PyPI, after a three month
warning period, to only serve links for PyPI-hosted packages under the
present-day ``simple/`` index**. At this point, present-day installation
tools will not see externally hosted links anymore, unless they specify
a new ``simple/-with-externals`` index which PYPI MUST offer ahead of
the start of the second transition phase. This new index contains
the external links as controled by a package maintainer. Moreover, PYPI
MUST also provide means to register and control download
links, independently from the current metadata and remote html-scraping
methods. At completion of the second transition phase, all present-day
installation tools will and all future installation releases SHALL
default to only install PYPI-hosted packages unless a user specifies
option(s) to include external links or the external index. If an
installation tool chooses to use the new ``simple/-with-externals/`` as
a default, it MUST warn a user with a precise messsage of which external
links were followed.
Maintainers of packages which currently host release files on non-PyPI
sites shall receive instructions and tools to ease "re-hosting" of
their historic and future package release files. The implementation
of such a re-hosting tool is expected but NOT REQUIRED to be available
at the beginning of phase 2.
Implementation
==============
The foundation of both transition phases is the introduction of three
"modes" of PyPI hosting for a package, effecting which links are
generated for the ``simple/`` index in transition phase 1. These modes
are implemented without requiring changes to installation tools via changes
to the algorithm for generating the machine-readable "/simple" index.
The modes are:
- ``pypi-ext-crawl``: no change from the current situation of generating
machine-readable links for installation tools, as outlined in the
history_.
- ``pypi-ext``: for a package in this mode, the "Home-page" and
"Download-url" links added to the simple index are given
``rel=ext-homepage`` and ``rel=ext-download`` attributes instead of
``rel=homepage`` and ``rel=download``. The effect of this (with no
change in installation tools neccessary) is that these links will
not be followed and scraped for further candidate links. Only installable
files linked directly from PyPI metadata (wherever they are hosted) will be
considered for installation.
- ``pypi-only``: for a package in this mode, only links to URLs on
PyPI itself will be added to the simple index.
At the end of the warning period of transition phase 2, the ``simple/``
index will be restricted to only show links to URLs on PyPI itself while the
``simple/-with-externals`` index will during both transition phases show
links to PYPI and any externals as controled by the package maintainer
and the hosting-mode.
For a package in ``pypi-only`` mode, external links will no longer be
automatically scraped from metadata and added to the two indexes.
However, PyPI will expose an interface for package maintainers to
explicitly specify any number of URLs to externally hosted installable
files for a given release, and these URLs will be added to the
``simple/-with-ext`` index page for that project but NOT to the basic
``simple/`` index page. Thus the ``-with-ext`` alternative index provides
a means for package owners with good reason to host their packages elsewhere a
means to do so (even under the ``pypi-only`` package mode) and still
have that information reflected on PyPI in machine-readable form, allowing
installation tool users an explicit and easy choice of whether they wish
to read an index that includes externally-hosted packages or one that
does not.
The goal of this PEP is that eventually all projects on PyPI can be
migrated to the ``pypi-only`` mode, while preserving the ability to
install release files hosted from third parties in an automated manner.
Deprecation of hosting-modes to eventually only allow the "pypi-only"
mode is NOT REGULATED by this PEP but is expected to become feasible
some time after successfull implementation of the two transition phases
described in this PEP.
Implementation and interaction timeline
--------------------------------------------------
The proposed solution consists of multiple implementation and
communication steps:
#. Implement in PyPI the three modes and the ``-with-ext`` index as
described above, and an interface for package owners to select the
mode for each package and register explicit external file URLs for
the ``-with-ext`` index (for projects in the ``pypi-only`` mode).
Default all newly-registered packages to ``pypi-only`` mode (but
package owners can still switch to the other modes as
desired). Implement in ``pep381client`` the mirroring of the
``-with-ext`` index pages.
#. Determine which packages have installable versions available that
are linked only from homepage/download pages (group B) and which
packages have all installable files available on PyPI itself (group
A).
#. Send mail to maintainers of projects in group A that their project
is going to be automatically configured to ``pypi-ext`` mode in one
month. Inform them that this change is not expected to affect
installability of their project at all, but will result in faster
and safer installs for their users. Encourage them to set this
mode (or ``pypi-only``) themselves earlier to benefit their users.
#. Send mail to maintainers of packages in group B that their package
hosting mode is ``pypi-ext-crawl``, list the sites which currently
are crawled, and suggest that they re-host their packages directly
on PyPI and then switch to ``pypi-only``. Provide instructions and
tools to help with this "re-uploading" process.
In addition, maintainers of installation tools are asked to release
two updates. The first one shall provide clear warnings if
externally-hosted packages (that is, packages at a URL whose domain
name differs from the domain name of the index URL in use) are
selected for download, for which projects and URLS exactly this
happens, and that in future versions externally-hosted downloads
will be disabled by default.
The second update for installation tools should change the default
mode to allow only installation of package files hosted at the index
domain, and allow installation of externally-hosted packages only when
the user supplies an option (ideally an option specifying exactly
which external domains are to be trusted as download sources). When
download of an externally-hosted package is disallowed, the user
should be notified, with instructions for how to make the install
succeed and warnings about the potential consequences.
It is expected that tools in this release may choose to change the
default index url to ``https://pypi.python.org/simple/-with-ext`` in
order to support explicitly-registered external URLs for projects in
``pypi-only`` mode. Tools may choose to do this only when the user
requests installation of externally-hosted packages, or may choose to
do this in all cases so as to be able to notify users when an
externally-hosted file is available.
Specific timelines for deprecation of ``pypi-ext-crawl`` and
``pypi-ext`` modes are not mandated in this PEP; this will depend on
observed behavior of package owners and availability of tooling. It is
expected that ``pypi-ext-crawl`` mode will be an early candidate for
deprecation; it may be necessary to leave ``pypi-ext`` mode in place
for quite some time, at least for those packages already
depending on it (it may be removed as an option for new packages when
tool support for explicit external URLs and the ``-with-ext`` index is
sufficient).
Open questions
==============
- Should we introduce a third index which maintains the old behaviour
of providing links irrespective of a maintainer's hosting-mode choice?
- should we introduce some form of PYPI API versioning in this PEP?
(it might complicate matters and delay the implementation but is
often seen as good practise)
References
==========
.. [1] Donald Stufft, ratio of externally hosted versus pypi-hosted, http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html (XXX need to update this data for all easy_install-supported formats)
.. [2] Marc-Andre Lemburg, reasons for external hosting, http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html
.. [3] Holger Krekel, Script to remove homepage/download metadata for all releases http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html
Acknowledgements
================
Philip Eby for precise information and the basic ideas to implement
the transition via server-side changes only.
Donald Stufft for pushing away from external hosting and
and offering to implement both a Pull Request for the neccessary PYPI changes
and the analysis tool to drive the transition phase 1.
Marc-Andre Lemburg, Nick Coghlan and catalog-sig in general for
thinking through issues regarding getting rid of "external hosting".
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
More information about the Catalog-SIG
mailing list