[Catalog-sig] pre-PEP: transition to release-file hosting at pypi site

holger krekel holger at merlinux.eu
Sun Mar 10 16:07:40 CET 2013


Hi Donald, Richard, Nick, Philip, Marc-Andre, all,

after some more thinking i wrote a simplified PEP draft for
transitioning hosting of release files to pypi.python.org.  A PEP is
warranted IMO because the according changes will affect all python
package maintainers and the Python packaging ecology in general.  See
the current draft (pre-submit-v1) further below in this mail. 
I also created a bitbucket repository, see "PEP-PYPI-DRAFT.txt"  at 

    https://bitbucket.org/hpk42/pep-pypi/src

Donald, i'd be happy if you join as a co-author and contribute
your statistics script and possibly more implementation stuff (PRs 
to pypi software etc.).  

Philip, Marc-Andre, Richard (Jones), Nick and catalog-sig/distutils-sig:
scrutiny and feedback welcome.

Nick: if you could collect feedback on the PEP (draft) around the 
packaging and distribution mini-summit at Pycon US (15th March), that'd 
be very useful.  

Richard: I may ask you to become BDFL-delegate for this PEP especially
since you will need to integrate any resulting changes :)

I'd like to formally submit this PEP soon but not before i got some 
feedback.

I am not subscribed to distutils-sig and i think distutils is not much
affected, but it probably still would help if someone cross-posts there
(please put me in CC).

cheers,
holger


PEP-draft: transition to release file hosting at pypi.python.org
=================================================================

Status
-----------

PRE-SUBMIT-v1

Abstract
------------

This PEP proposes to move hosting of all release files to
pypi.python.org itself.  To ease transition and minimize client-side
friction, **no changes to distutils or installers** are required.
Rather, the transition is implemented through changes to the pypi.python.org 
implementation and by interactions with package maintainers.

Problem
---------------

Today, python package installers (pip and easy_install) need to
query multiple sites to discover release files.  Apart from querying
pypi.python.org's simple index pages, also all homepages and
download pages ever specified with any release of a package need to
be crawled by an installer.  The need for installers to crawl 3rd party
sites slows down installation and makes for a brittle unreliable 
installation process. 

As of March 2013, about 10% of packages have release files which
are not hosted directly from pypi.python.org but rather from places
referenced by download/homepage sites.  

Conversely, roughly 90% of packages are hosted directly on
pypi.python.org [1]_.  Even for them installers still need to crawl the
homepage(s) of a package.  Many package uploaders are particularly not
aware that specifying the "homepage" will slow down the installation
process.


Solution
-----------

Each package is going to get a "hosting mode" field which effects
all historic and future releases of a package and its release files.
The field has these values and meanings:                            

- "pypi-ext" (transitional) encodes exactly the current mode of operations:
  homepage/download urls are presented in simple/ pages and client-side
  tools need to crawl them themselves to find release file links. 

- "pypi-cache": Release files located on remote sites will be downloaded 
  and cached by pypi.python.org by crawling homepage/download metadata sites.
  The resulting simple index contains links to release files hosted by
  pypi.python.org.  The original homepage/download links are added as
  links without a ``rel`` attribute if they have the ``#egg`` format.

- "pypi-only": homepage/download links are served on simple indexes
  but without a ``rel`` attribute.  Installation tools will thus not
  crawl those pages anymore.  Use this option if you commit to always
  uploading your release files to pypi.python.org.


Phases of transition
-------------------------

1. At the outset, we set hosting-mode to "pypi-ext" for all packages.
   This will not change any link served via the simple index and thus
   no bad effects are expected.  Early adopters and testers may now
   change the mode to either pypi-only or pypy-cache to help with
   streamlining issues.  After implementation and UI issues are
   streamlined, the next phase can start.

2. We perform automatic analysis for each package to determine if it is
   a package with externally hosted release files.  Packages which only 
   have release files on pypi.python.org are put in the group "A",
   those which have at least some packages outside are put in the group "B".

   We sent then a mail to all maintainers of packages in A 
   that their hosting-mode is going to be switched automatically to 
   "pypi-only" after N weeks, unless they visit their package
   administration page earlier and set it to either pypi-cache or
   pypi-only earlier.

   We sent then a mail to all maintainers of packages in B
   that their hosting-mode is going to be switched automatically to 
   "pypi-cache" after N weeks, unless they visit their package
   administration page and set it to either pypi-only or
   pypi-cache earlier.

3. all packages will have a hosting mode of either "pypi-cache"
   or "pypi-only", resulting in installers to only query
   packages hosted through pypi.python.org.
  

Transitioning to "pypi-cache" mode
-------------------------------------

When transitioning from the currently implicit "pypi-ext" mode to
"pypi-cache" for a given package, a package maintainer should 
be able to retrieve/verify the historic release files which will 
be cached from pypi.python.org.  The UI should present this list
and have the maintainer accept it for completing the transition
to the "pypi-cache" mode.  Upon future release registration actions,
pypi.python.org will perform crawling for the homepage/download sites
and cache release files *before* returning a success return code for
the release registration.


References
------------

.. [1] ratio of externally hosted versus pypi-hosted http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html

Acknowledgments
----------------------

Donald Stufft for pushing away from external hosting and doing
the 90/10 % statistics script and offering to implement a PR.

Philip Eby for precise information and the basic idea to
implement the transition via server-side changes only.

Marc-Andre Lemburg, Nick Coghlan and catalog-sig for thinking
through issues regarding getting rid of "external hosting".


Copyright
-----------------

This document has been placed in the public domain.




More information about the Catalog-SIG mailing list