[Catalog-sig] V4 Pre-PEP: transition to release-file hosting on PYPI

holger krekel holger at merlinux.eu
Fri Mar 15 10:29:59 CET 2013


Hi all, in particular Philip, Marc-Andre, Donald,

Carl and me decided to simplify the PEP and avoid the somewhat
awkward ``simple/-with-externals`` index for various reasons, among them
Marc-Andre's criticisms.  This also means present-day installation tools
(shipped with Redhat/Debian/etc.) will continue to work as today for
those packages which remain in a hosting-mode that requires crawling and
scraping.  They will still benefit from the fact that most packages will
soon have a hosting-mode that avoids it.  Future releases of installation
tools will default to not perform crawling or using (scraped) external
links, and new PYPI projects will default to only serve uploaded files.

The V4 pre-PEP also renames the three PyPI hosting modes to be more
descriptive. Since all three modes allow external links, "pypi-ext" vs
"pypi-only" were misleading. The new naming distinguishes the mode that both
scrapes links from metadata and crawls external pages for more links
("pypi-scrape-crawl") from the mode that only scrapes links from metadata
("pypi-scrape") from the mode where all links are explicit ("pypi-explicit").

Without the separate external index, it also turns out that the two transition
phases are separated into PyPI changes (phase one) and installer-tool
updates (phase two). There are no PyPI changes necessary in phase two.
As stated in a new open question, it should be possible to do 
PEP-related installation tool updates during phase 1, that may require
a bit of clarification in the PEP's language still.

Carl and me are happy with this PEP version now and hope you all are as
well.  Donald is already working on improving the analysis tool so
we hopefully have some updated numbers soon.

cheers,

Holger


PEP: XXX
Title: Transitioning to release-file hosting on PyPI
Version: $Revision$
Last-Modified: $Date$
Author: Holger Krekel <holger at merlinux.eu>, Carl Meyer <carl at oddbird.net>
Discussions-To: catalog-sig at python.org
Status: Draft (PRE-submit V4)
Type: Process
Content-Type: text/x-rst
Created: 10-Mar-2013
Post-History:


Abstract
========

This PEP proposes a backward-compatible two-phase transition process
to speed up, simplify and robustify installing from the
pypi.python.org (PyPI) package index.  To ease the transition and
minimize client-side friction, **no changes to distutils or existing
installation tools are required in order to benefit from the first
transition phase, which will result in faster, more reliable installs
for most existing packages**.

The first transition phase implements an easy and explicit means for a
package maintainer to control which release file links are served to
present-day installation tools.  The first phase also includes the
implementation of analysis tools for present-day packages, to support
communication with package maintainers and the automated setting of
default modes for controlling release file links.  The first phase
also will make new projects on PYPI use a default to only serve 
links to release files which were uploaded to PYPI.

The second transition phase concerns end-user installation tools,
which shall default to only install release files that are hosted on
PyPI and tell the user if external release files exist, offering
a choice to automatically use those external files.


Rationale
=========

.. _history:

History and motivations for external hosting
--------------------------------------------

When PyPI went online, it offered release registration but had no
facility to host release files itself.  When hosting was added, no
automated downloading tool existed yet.  When Philip Eby implemented
automated downloading (through setuptools), he made the choice to
allow people to use download hosts of their choice.  The finding of
externally-hosted packages was implemented as follows:

#. The PyPI ``simple/`` index for a package contains all links found
   by scraping them from that package's long_description metadata for 
   any release. Links in the "Download-URL" and "Home-page" metadata
   fields are given ``rel=download`` and ``rel=homepage`` attributes,
   respectively.

#. Any of these links whose target is a file whose name appears to be
   in the form of an installable source or binary distribution, with
   name in the form "packagename-version.ARCHIVEEXT", is considered a
   potential installation candidate by installation tools.

#. Similarly, any links suffixed with an "#egg=packagename-version"
   fragment are considered an installation candidate.

#. Additionally, the ``rel=homepage`` and ``rel=download`` links are
   crawled by installation tools and, if HTML, are themselves scraped
   for release-file links in the above formats.

Today, most packages released on PyPI host their release files on
PyPI, but a small percentage (XXX need updated data) rely on external
hosting.

There are many reasons [2]_ why people have chosen external
hosting. To cite just a few:

- release processes and scripts have been developed already and upload
  to external sites

- it takes too long to upload large files from some places in the
  world

- export restrictions e.g. for crypto-related software

- company policies which require offering open source packages
  through own sites

- problems with integrating uploading to PyPI into one's release
  process (because of release policies)

- desiring download statistics different from those maintained by PyPI

- perceived bad reliability of PyPI

- not aware that PyPI offers file-hosting

Irrespective of the present-day validity of these reasons, there
clearly is a history why people choose to host files externally and it
even was for some time the only way you could do things.  This PEP
takes the position that there are at least some valid reasons for
external hosting.

Problem
-------

**Today, python package installers (pip, easy_install, buildout, and
others) often need to query many non-PyPI URLs even if there are no
externally hosted files**.  Apart from querying pypi.python.org's
simple index pages, also all homepages and download pages ever
specified with any release of a package are crawled by an installer.
The need for installers to crawl external sites slows down
installation and makes for a brittle and unreliable installation
process.  Those sites and packages also don't take part in the
:pep:`381` mirroring infrastructure, further decreasing reliability
and speed of automated installation processes around the world.

Most packages are hosted directly on pypi.python.org [1]_.  Even for
these packages, installers still crawl their homepage and
download-url, if specified.  Many package uploaders are not aware that
specifying the "homepage" or "download-url" in their package metadata
will needlessly slow down the installation process for all users.

Relying on third party sites also opens up more attack vectors for
injecting malicious packages into sites using automated installs.  A
simple attack might just involve getting hold of an old now-unused
homepage domain and placing malicious packages there.  Moreover,
performing a Man-in-The-Middle (MITM) attack between an installation
site and any of the download sites can inject malicious packages on
the installation site.  As many homepages and download locations are
using HTTP and not HTTPS, such attacks are not hard to launch.  Such
MITM attacks can easily happen even for packages which never intended
to host files externally as their homepages are contacted by
installers anyway.

There is currently no way for package maintainers to avoid
external-link crawling, other than removing all homepage/download url
metadata for all historic releases.  While a script [3]_ has been
written to perform this action, it is not a good general solution
because it removes useful metadata from PyPI releases.

Even if the sites referenced by "Homepage" and "Download-URL" links were 
not scraped for further links, there is no obvious way under the current
system for a package owner to link to an installable file from a 
long_description metadata field (which is shown as package documentation
on ``/pypi/PKG``) without installation tools automatically considering
that file a candidate for installation.  Conversely, there is no way
to explicitely register multiple external release files without 
putting them in metadata fields.


Goals
-----

These are the goals to be achieved by implementation of this PEP:

* Package owners should be able to explicitly control which files are
  presented by PyPI to installer tools as installation
  candidates. Installation should not be slowed and made less reliable
  by extensive and unnecessary crawling of links that package owners
  did not explicitly nominate as installation files.

* It should remain possible for package owners to choose to host their
  release files on their own hosting, external to PyPI. It should be
  easy for a user to request the installation of such releases using
  automated installer tools.

* Automated installer tools should not install externally-hosted
  packages **by default**, but only when explicitly authorized to do
  so by the user. When tools refuse to install such a package by
  default, they should tell the user exactly which external link(s)
  they would need to follow, and what option(s) the user can provide
  to authorize the tool to follow those links. PyPI should provide all
  necessary metadata for installer tools to implement this easily
  and within a single request/reply interaction.

* Migration from the status quo to the above points should be gradual
  and minimize breakage. This includes tooling that makes it easy for
  package owners with an existing release process that uploads to
  non-PyPI hosting to also upload those release files to PyPI.  


Solution / two transition phases
================================

The first transition phase introduces a "hosting-mode" field for each
project on PyPI, allowing package owners explicit control of which
release file links are served to present-day installation tools in the
machine-readable ``simple/`` index. The first transition will, after
successful hosting-mode manipulations by individual early-adopters,
set a default hosting mode for existing packages, based on
automated analysis.  **Maintainers will be notified one month ahead of
any such automated change**.  At completion of the first transition
phase, **all present-day existing release and installation processes
and tools are expected to continue working**.  Any remaining errors or
problems are expected to only relate to installation of individual
packages and can be easily corrected by package maintainers or PyPI
admins if maintainers are not reachable.

Also in the first phase, each link served in the ``simple/`` index
will be explicitly marked as ``rel="internal"`` (hosted by the index
itself) or ``rel="external"`` (linking to an external site that is not
part of the index).

In the second transition phase, PyPI client installation tools shall
be updated to default to only install ``rel="internal"`` packages
unless a user specifies option(s) to permit installing from external
links.

Maintainers of packages which currently host release files on non-PyPI
sites shall receive instructions and tools to ease "re-hosting" of
their historic and future package release files.  This re-hosting tool
MUST be available before automated hosting-mode changes are announced
to package maintainers.


Implementation
==============

Hosting modes
-------------

The foundation of the first transition phase is the introduction of
three "modes" of PyPI hosting for a package, affecting which links are
generated for the ``simple/`` index.  These modes are implemented
without requiring changes to installation tools via changes to the
algorithm for generating the machine-readable ``simple/`` index.

The modes are:

- ``pypi-scrape-crawl``: no change from the current situation of
  generating machine-readable links for installation tools, as
  outlined in the history_.

- ``pypi-scrape``: for a package in this mode, links to be added to
  the ``simple/`` index are still scraped from package
  metadata. However, the "Home-page" and "Download-url" links are
  given ``rel=ext-homepage`` and ``rel=ext-download`` attributes
  instead of ``rel=homepage`` and ``rel=download``. The effect of this
  (with no change in installation tools necessary) is that these links
  will not be followed and scraped for further candidate links by present-day
  installation tools: only installable files directly hosted from PYPI or
  linked directly from PyPI metadata will be considered for installation.
  Installation tools MAY evolve to offer an option to use the new 
  rel-attribution to crawl external pages but MUST NOT default to it.

- ``pypi-explicit``: for a package in this mode, only links to release
  files uploaded to PyPI, and external links to release files
  explicitly nominated by the package owner (via a new interface
  exposed by PyPI) will be added to the ``simple/`` index.

Thus the hope is that eventually all projects on PyPI can be migrated
to the ``pypi-explicit`` mode, while preserving the ability to install
release files hosted externally via installer tools. Deprecation of
hosting modes to eventually only allow the ``pypi-explicit`` mode is
NOT REGULATED by this PEP but is expected to become feasible some time
after successful implementation of the transition phases described in
this PEP.  It is expected that deprecation requires **a new process to deal 
with abandoned packages** because of unreachable maintainers for still
popular packages.


First transition phase (PyPI)
-----------------------------

The proposed solution consists of multiple implementation and
communication steps:

#. Implement in PyPI the three modes described above, with an
   interface for package owners to select the mode for each package
   and register explicit external file URLs.

#. For packages in all modes, label all links in the ``simple/`` index
   with ``rel="internal"`` or ``rel="external"``, to make it easier
   for client tools to distinguish the types of links in the second
   transition phase.

#. Default all newly-registered packages to ``pypi-explicit`` mode
   (package owners can still switch to the other modes as desired).

#. Determine (via an automated analysis tool) which packages have all
   installable files available on PyPI itself (group A), which have
   all installable files linked directly from PyPI metadata (group B),
   and which have installable versions available that are linked only
   from external homepage/download HTML pages (group C).

#. Send mail to maintainers of projects in group A that their project
   will be automatically configured to ``pypi-explicit`` mode in one
   month, and similarly to maintainers of projects in group B that
   their project will be automatically configured to ``pypi-scrape``
   mode.  Inform them that this change is not expected to affect
   installability of their project at all, but will result in faster
   and safer installs for their users.  Encourage them to set this
   mode themselves sooner to benefit their users.

#. Send mail to maintainers of packages in group C that their package
   hosting mode is ``pypi-scrape-crawl``, list the URLs which
   currently are crawled, and suggest that they either re-host their
   packages directly on PyPI and switch to ``pypi-explicit``, or at
   least provide direct links to release files in PyPI metadata and
   switch to ``pypi-scrape``.  Provide instructions and tools to help
   with these transitions.


Second transition phase (installer tools)
-----------------------------------------

For the second transition phase, maintainers of installation tools are
asked to release two updates. 

The first update shall provide clear warnings if externally-hosted
release files (that is, files whose link is ``rel="external"``) are
selected for download, for which projects and URLs exactly this
happens, and warn that in future versions externally-hosted downloads
will be disabled by default.

The second update should change the default mode to allow only
installation of ``rel="internal"`` package files, and allow
installation of externally-hosted packages only when the user supplies
an option (ideally an option specifying exactly which external domains
are to be trusted as download sources). When download of an
externally-hosted package is disallowed, the user should be notified,
with instructions for how to make the install succeed and warnings
about the implication (that a file will be downloaded from a site that
is not part of the package index).


Open questions / Tasks
===========================

- Should we introduce some form of PyPI API versioning in this PEP?
  (it might complicate matters and delay the implementation but is
  often seen as good practise).

- in pypi-scrape mode: does PYPI determine itself what are installation
  candidates and avoids presenting other random links (which are currently
  served)?

- consider that installation tools may choose to release updates 
  during transition phase 1 already, to warn about crawling and scraped
  links (which are easily identifiable today and after the new rel-attribution
  after transition phase 1).


References
==========

.. [1] Donald Stufft, ratio of externally hosted versus pypi-hosted, http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html (XXX need to update this data for all easy_install-supported formats)

.. [2] Marc-Andre Lemburg, reasons for external hosting, http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html

.. [3] Holger Krekel, Script to remove homepage/download metadata for all releases http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html

Acknowledgments
================

Philip Eby for precise information and the basic ideas to implement
the transition via server-side changes only.

Donald Stufft for pushing away from external hosting and offering to
implement both a Pull Request for the necessary PyPI changes and the
analysis tool to drive the transition phase 1.

Marc-Andre Lemburg, Nick Coghlan and catalog-sig in general for
thinking through issues regarding getting rid of "external hosting".

Copyright
=========

This document has been placed in the public domain.



..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:



More information about the Catalog-SIG mailing list