[Catalog-sig] V2 pre-PEP: transitioning to release file hosting on PYPI

Tue Mar 12 12:38:17 CET 2013

Hi all,

below is the new PEP pre-submit version (V2) which incorporates the
latest suggestions and aims at a rapidly deployable solution.  Thanks in
particular to Philip, Donald and Marc-Andre.  I also added a few notes
on how installers should behave with respect to non-PYPI crawling.  

I think a PEP like doc is warranted and that we should not silently
change things without proper communication to maintainers and pre-planning
the implementation/change process.  Arguably, the changes are more
invasive than "oh, let's just do a http->https redirect" which didn't
work too well either.

Now, if there is some agreement, i can submit this PEP officially tomorrow,
and given agreement/refinments from the Pycon folks and the likes of
Richard, we may be able to get going very shortly after Pycon.

cheers,
holger

PEP-draft: transitioning to release-file hosting on PYPI
====================================================================

Status
-----------

PRE-SUBMIT-v2

Abstract
------------

This PEP proposes a backward-compatible transition process to speed up,
simplify and robustify installing from the pypi.python.org (PYPI)
package index.  The initial transition will put most packages on PYPI
automatically in a configuration mode which will prevent client-side
crawling from installers.  To ease automatic transition and minimize
client-side friction, **no changes to distutils or installation tools** are
required.  Instead, the transition is implemented by modifying PYPI to
serve links from ``simple/`` pages in a configurable way, preventing or
allowing crawling of non-PYPI sites for detecting release files.
Maintainers of all PYPI packages will be notified ahead of those
changes.

Maintainers of packages which currently are hosted on non-PYPI sites
shall receive instructions and tools to ease "re-hosting" of their
historic and future package release files.  The implementation of such
tools is NOT required for implementing the initial automatic transition.

Installation tools like pip and easy_install shall warn about crawling
non-PYPI sites and later default to disallow it and only allow it with
an explicit option.

History and motivations for external hosting
------------------------------------------------

When PYPI went online, it offered release registration but had no
facility to host release files itself.  When hosting was added, no
automated downloading tool existed yet.  When Philip Eby implemented
automated downloading (through setuptools), he made the choice 
to allow people to use download hosts of their choice.  This was
implemented by the PYPI ``simple/`` index containing links of type
``rel=homepage`` or ``rel=download`` which are crawled by installation
tools to discover package links.  As of March 2013, a substantial part 
of packages (estimated to about 10%) make use of this mechanism to host
files on github, bitbucket, sourceforge or own hosting sites like 
``mercurial.selenic.com``, to just name a few.

There are many reasons [2]_ why people choose to use external hosting,
to cite just a few:

- release processes and scripts have been developed already and 
  upload to external sites 

- it takes too long to upload large files from some places in the world

- export restrictions e.g. for crypto-related software

- company policies which prescribe offering open source packages through
  own sites

- problems with integrating uploading to PYPI into one's release process
  (because of release policies)

- perceived bad reliability of PYPI

- missing knowlege you can upload files 

Irrespective of the present-day validity of these reasons, there clearly
is a history why people choose to host files externally and it even was 
for some time the only way you could do things.  

Problem
---------------

**Today, python package installers (pip and easy_install) often need to
query non-PYPI sites even if there are no externally hosted files**.
Apart from querying pypi.python.org's simple index pages, also all
homepages and download pages ever specified with any release of a
package are crawled by an installer.  The need for installers to
crawl 3rd party sites slows down installation and makes for a brittle
unreliable installation process.   Those sites and packages also don't 
take part in the :pep:`381` mirroring infrastructure, further decreasing
reliability and speed of automated installation processes around the world. 

Roughly 90% of packages are hosted directly on pypi.python.org [1]_.
Even for them installers still need to crawl the homepage(s) of a
package.  Many package uploaders are particularly not aware that
specifying the "homepage" in their release process will slow down 
the installation process for all its users.

Relying on third party sites also opens up more attack vectors
for injecting malicious packages into sites using automated installs.  
A simple attack might just involve getting hold of an old now-unused
homepage domain and placing mailicious packages there.  Moreover,
performing a Man-in-The-Middle (MITM) attack between an installation
site and any of the download sites can inject mailicious packages on the
installation site.  As many homepages and download locations are using
HTTP and not proper HTTPS, such attacks are not very hard to launch.
Such MITM attacks can happen even for packages which never intended to
host files externally as their homepages are contacted by installers
anyway.

There is currently no way for package maintainers to avoid 3rd party
crawling, other than removing all homepage/download url metadata
for all historic releases.  While a script [3]_ has been written to 
perform this action, it is not a good general solution because it removes
semantic information like the "homepage" specification from PYPI packages.

Solution
-----------

The proposed solution consists of the following implementation and
communication steps:

- determine which packages have releases files only on PYPI (group A)
  and which have externally hosted release files (group B).

- Prepare PYPI implementation to allow a per-project "hosting mode",
  effectively enabling or disabling external crawling.  When enabled 
  nothing changes from the current situation of producing ``rel=download`` 
  and ``rel=homepage`` attributed links on ``simple/`` pages, 
  causing installers to crawl those sites.  
  When disabled, the attributions of links will change 
  to ``rel=newdownload`` and ``rel=newhomepage`` causing installers to
  avoid crawling 3rd party sites.  Retaining the meta-information allows
  tools to still make use of the semantic information.

- send mail to maintainers of A that their project is going to be 
  automatically configured to "disable crawling" in one week
  and encourage them to set this mode earlier to help all of 
  their users.

- send mail to maintainers of B that their package hosting mode 
  is "crawling enabled", and list the sites which currently are crawled,
  and suggest that they re-host their packages directly on PYPI and 
  then switch the hosting-mode "disable crawling".  Provide instructions 
  and at best tools to help with this "re-uploading" process.

In addition, maintainers of installation tools are asked to release
two updates.  The first one shall provide clear warnings if external
crawling needs to happen, for which projects and URLS exactly 
this happens, and that in the future crawling will be disabled by default.  
The next update shall change the default to disallow crawling and allow 
crawling only with an explicit option like ``--crawl-externals`` and 
another option allowing to limit which hosts are allowed to be crawled
at all.

Hosting-Mode state transitions
----------------------------------

1. At the outset, we set hosting-mode to "notset" for all packages.
   This will not change any link served via the simple index and thus
   no bad effects are expected.  Early adopters and testers may now
   change the mode to either "crawl" or "nocrawl" to help with
   streamlining issues in the PYPI implementation.

2. When maintainers of B packages are mailed their mode is directly
   set to "crawl".

3. When maintainers of A are mailed we leave the mode at "notset" to allow
   people to change it to "nocrawl" themselves or to set it to "crawl" 
   if they think they are wrongly in the "A" group.  After a week 
   all "notset" modes are set to "nocrawl".

A week after the mailings all packages will be in "crawl" or "nocrawl"
hosting mode.  It is then a matter of good tools and reaching out to
maintainers of B packages to increase the A/B ratio.

Open questions
----------------------

- Should the support tools for "rehosting" packages be implemented  on the
  server side or on the client side?  Implementing it on the client
  side probably is quicker to get right and less fatal in terms of failures.

- double-check if ``rel=newhomepage`` and ``rel=newdownload`` cause the 
  desired behaviour of pip and easy_install (both the distribute and 
  setuptools based one) to not crawl those pages.

- are the "support tools" for re-hosting outside the scope of this PEP?

- Think some more about pip/easy_install "allow-hosts" mode etc.

References
------------

.. [1] Donald Stufft, ratio of externally hosted versus pypi-hosted, http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html

.. [2] Marc-Andre Lemburg, reasons for external hosting, http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html

.. [3] Holger Krekel, Script to remove homepage/download metadata for
       all releases http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html

Acknowledgments
----------------------

Philip Eby for precise information and the basic ideas to
implement the transition via server-side changes only.

Donald Stufft for pushing away from external hosting and doing
the 90/10 % statistics script and offering to implement a PR.

Marc-Andre Lemburg, Nick Coghlan and catalog-sig for thinking
through issues regarding getting rid of "external hosting".

Copyright
-----------------

This document has been placed in the public domain.