[Catalog-sig] V2 pre-PEP: transitioning to release file hosting on PYPI
holger krekel
holger at merlinux.eu
Tue Mar 12 12:38:17 CET 2013
Hi all,
below is the new PEP pre-submit version (V2) which incorporates the
latest suggestions and aims at a rapidly deployable solution. Thanks in
particular to Philip, Donald and Marc-Andre. I also added a few notes
on how installers should behave with respect to non-PYPI crawling.
I think a PEP like doc is warranted and that we should not silently
change things without proper communication to maintainers and pre-planning
the implementation/change process. Arguably, the changes are more
invasive than "oh, let's just do a http->https redirect" which didn't
work too well either.
Now, if there is some agreement, i can submit this PEP officially tomorrow,
and given agreement/refinments from the Pycon folks and the likes of
Richard, we may be able to get going very shortly after Pycon.
cheers,
holger
PEP-draft: transitioning to release-file hosting on PYPI
====================================================================
Status
-----------
PRE-SUBMIT-v2
Abstract
------------
This PEP proposes a backward-compatible transition process to speed up,
simplify and robustify installing from the pypi.python.org (PYPI)
package index. The initial transition will put most packages on PYPI
automatically in a configuration mode which will prevent client-side
crawling from installers. To ease automatic transition and minimize
client-side friction, **no changes to distutils or installation tools** are
required. Instead, the transition is implemented by modifying PYPI to
serve links from ``simple/`` pages in a configurable way, preventing or
allowing crawling of non-PYPI sites for detecting release files.
Maintainers of all PYPI packages will be notified ahead of those
changes.
Maintainers of packages which currently are hosted on non-PYPI sites
shall receive instructions and tools to ease "re-hosting" of their
historic and future package release files. The implementation of such
tools is NOT required for implementing the initial automatic transition.
Installation tools like pip and easy_install shall warn about crawling
non-PYPI sites and later default to disallow it and only allow it with
an explicit option.
History and motivations for external hosting
------------------------------------------------
When PYPI went online, it offered release registration but had no
facility to host release files itself. When hosting was added, no
automated downloading tool existed yet. When Philip Eby implemented
automated downloading (through setuptools), he made the choice
to allow people to use download hosts of their choice. This was
implemented by the PYPI ``simple/`` index containing links of type
``rel=homepage`` or ``rel=download`` which are crawled by installation
tools to discover package links. As of March 2013, a substantial part
of packages (estimated to about 10%) make use of this mechanism to host
files on github, bitbucket, sourceforge or own hosting sites like
``mercurial.selenic.com``, to just name a few.
There are many reasons [2]_ why people choose to use external hosting,
to cite just a few:
- release processes and scripts have been developed already and
upload to external sites
- it takes too long to upload large files from some places in the world
- export restrictions e.g. for crypto-related software
- company policies which prescribe offering open source packages through
own sites
- problems with integrating uploading to PYPI into one's release process
(because of release policies)
- perceived bad reliability of PYPI
- missing knowlege you can upload files
Irrespective of the present-day validity of these reasons, there clearly
is a history why people choose to host files externally and it even was
for some time the only way you could do things.
Problem
---------------
**Today, python package installers (pip and easy_install) often need to
query non-PYPI sites even if there are no externally hosted files**.
Apart from querying pypi.python.org's simple index pages, also all
homepages and download pages ever specified with any release of a
package are crawled by an installer. The need for installers to
crawl 3rd party sites slows down installation and makes for a brittle
unreliable installation process. Those sites and packages also don't
take part in the :pep:`381` mirroring infrastructure, further decreasing
reliability and speed of automated installation processes around the world.
Roughly 90% of packages are hosted directly on pypi.python.org [1]_.
Even for them installers still need to crawl the homepage(s) of a
package. Many package uploaders are particularly not aware that
specifying the "homepage" in their release process will slow down
the installation process for all its users.
Relying on third party sites also opens up more attack vectors
for injecting malicious packages into sites using automated installs.
A simple attack might just involve getting hold of an old now-unused
homepage domain and placing mailicious packages there. Moreover,
performing a Man-in-The-Middle (MITM) attack between an installation
site and any of the download sites can inject mailicious packages on the
installation site. As many homepages and download locations are using
HTTP and not proper HTTPS, such attacks are not very hard to launch.
Such MITM attacks can happen even for packages which never intended to
host files externally as their homepages are contacted by installers
anyway.
There is currently no way for package maintainers to avoid 3rd party
crawling, other than removing all homepage/download url metadata
for all historic releases. While a script [3]_ has been written to
perform this action, it is not a good general solution because it removes
semantic information like the "homepage" specification from PYPI packages.
Solution
-----------
The proposed solution consists of the following implementation and
communication steps:
- determine which packages have releases files only on PYPI (group A)
and which have externally hosted release files (group B).
- Prepare PYPI implementation to allow a per-project "hosting mode",
effectively enabling or disabling external crawling. When enabled
nothing changes from the current situation of producing ``rel=download``
and ``rel=homepage`` attributed links on ``simple/`` pages,
causing installers to crawl those sites.
When disabled, the attributions of links will change
to ``rel=newdownload`` and ``rel=newhomepage`` causing installers to
avoid crawling 3rd party sites. Retaining the meta-information allows
tools to still make use of the semantic information.
- send mail to maintainers of A that their project is going to be
automatically configured to "disable crawling" in one week
and encourage them to set this mode earlier to help all of
their users.
- send mail to maintainers of B that their package hosting mode
is "crawling enabled", and list the sites which currently are crawled,
and suggest that they re-host their packages directly on PYPI and
then switch the hosting-mode "disable crawling". Provide instructions
and at best tools to help with this "re-uploading" process.
In addition, maintainers of installation tools are asked to release
two updates. The first one shall provide clear warnings if external
crawling needs to happen, for which projects and URLS exactly
this happens, and that in the future crawling will be disabled by default.
The next update shall change the default to disallow crawling and allow
crawling only with an explicit option like ``--crawl-externals`` and
another option allowing to limit which hosts are allowed to be crawled
at all.
Hosting-Mode state transitions
----------------------------------
1. At the outset, we set hosting-mode to "notset" for all packages.
This will not change any link served via the simple index and thus
no bad effects are expected. Early adopters and testers may now
change the mode to either "crawl" or "nocrawl" to help with
streamlining issues in the PYPI implementation.
2. When maintainers of B packages are mailed their mode is directly
set to "crawl".
3. When maintainers of A are mailed we leave the mode at "notset" to allow
people to change it to "nocrawl" themselves or to set it to "crawl"
if they think they are wrongly in the "A" group. After a week
all "notset" modes are set to "nocrawl".
A week after the mailings all packages will be in "crawl" or "nocrawl"
hosting mode. It is then a matter of good tools and reaching out to
maintainers of B packages to increase the A/B ratio.
Open questions
----------------------
- Should the support tools for "rehosting" packages be implemented on the
server side or on the client side? Implementing it on the client
side probably is quicker to get right and less fatal in terms of failures.
- double-check if ``rel=newhomepage`` and ``rel=newdownload`` cause the
desired behaviour of pip and easy_install (both the distribute and
setuptools based one) to not crawl those pages.
- are the "support tools" for re-hosting outside the scope of this PEP?
- Think some more about pip/easy_install "allow-hosts" mode etc.
References
------------
.. [1] Donald Stufft, ratio of externally hosted versus pypi-hosted, http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html
.. [2] Marc-Andre Lemburg, reasons for external hosting, http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html
.. [3] Holger Krekel, Script to remove homepage/download metadata for
all releases http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html
Acknowledgments
----------------------
Philip Eby for precise information and the basic ideas to
implement the transition via server-side changes only.
Donald Stufft for pushing away from external hosting and doing
the 90/10 % statistics script and offering to implement a PR.
Marc-Andre Lemburg, Nick Coghlan and catalog-sig for thinking
through issues regarding getting rid of "external hosting".
Copyright
-----------------
This document has been placed in the public domain.
More information about the Catalog-SIG
mailing list