[Catalog-sig] pre-PEP: transition to release-file hosting at pypi site

Donald Stufft donald at stufft.io
Sun Mar 10 18:35:00 CET 2013


On Mar 10, 2013, at 11:07 AM, holger krekel <holger at merlinux.eu> wrote:

> Hi Donald, Richard, Nick, Philip, Marc-Andre, all,
> 
> after some more thinking i wrote a simplified PEP draft for
> transitioning hosting of release files to pypi.python.org.  A PEP is
> warranted IMO because the according changes will affect all python
> package maintainers and the Python packaging ecology in general.  See
> the current draft (pre-submit-v1) further below in this mail. 
> I also created a bitbucket repository, see "PEP-PYPI-DRAFT.txt"  at 
> 
>    https://bitbucket.org/hpk42/pep-pypi/src
> 
> Donald, i'd be happy if you join as a co-author and contribute
> your statistics script and possibly more implementation stuff (PRs 
> to pypi software etc.).  
> 
> Philip, Marc-Andre, Richard (Jones), Nick and catalog-sig/distutils-sig:
> scrutiny and feedback welcome.
> 
> Nick: if you could collect feedback on the PEP (draft) around the 
> packaging and distribution mini-summit at Pycon US (15th March), that'd 
> be very useful.  
> 
> Richard: I may ask you to become BDFL-delegate for this PEP especially
> since you will need to integrate any resulting changes :)
> 
> I'd like to formally submit this PEP soon but not before i got some 
> feedback.
> 
> I am not subscribed to distutils-sig and i think distutils is not much
> affected, but it probably still would help if someone cross-posts there
> (please put me in CC).
> 
> cheers,
> holger
> 
> 
> PEP-draft: transition to release file hosting at pypi.python.org
> =================================================================
> 
> Status
> -----------
> 
> PRE-SUBMIT-v1
> 
> Abstract
> ------------
> 
> This PEP proposes to move hosting of all release files to
> pypi.python.org itself.  To ease transition and minimize client-side
> friction, **no changes to distutils or installers** are required.
> Rather, the transition is implemented through changes to the pypi.python.org 
> implementation and by interactions with package maintainers.
> 
> Problem
> ---------------
> 
> Today, python package installers (pip and easy_install) need to
> query multiple sites to discover release files.  Apart from querying
> pypi.python.org's simple index pages, also all homepages and
> download pages ever specified with any release of a package need to
> be crawled by an installer.  The need for installers to crawl 3rd party
> sites slows down installation and makes for a brittle unreliable 
> installation process. 
> 
> As of March 2013, about 10% of packages have release files which
> are not hosted directly from pypi.python.org but rather from places
> referenced by download/homepage sites.  
> 
> Conversely, roughly 90% of packages are hosted directly on
> pypi.python.org [1]_.  Even for them installers still need to crawl the
> homepage(s) of a package.  Many package uploaders are particularly not
> aware that specifying the "homepage" will slow down the installation
> process.
> 
> 
> Solution
> -----------
> 
> Each package is going to get a "hosting mode" field which effects
> all historic and future releases of a package and its release files.
> The field has these values and meanings:                            
> 
> - "pypi-ext" (transitional) encodes exactly the current mode of operations:
>  homepage/download urls are presented in simple/ pages and client-side
>  tools need to crawl them themselves to find release file links. 
> 
> - "pypi-cache": Release files located on remote sites will be downloaded 
>  and cached by pypi.python.org by crawling homepage/download metadata sites.
>  The resulting simple index contains links to release files hosted by
>  pypi.python.org.  The original homepage/download links are added as
>  links without a ``rel`` attribute if they have the ``#egg`` format.
> 
> - "pypi-only": homepage/download links are served on simple indexes
>  but without a ``rel`` attribute.  Installation tools will thus not
>  crawl those pages anymore.  Use this option if you commit to always
>  uploading your release files to pypi.python.org.
> 
> 
> Phases of transition
> -------------------------
> 
> 1. At the outset, we set hosting-mode to "pypi-ext" for all packages.
>   This will not change any link served via the simple index and thus
>   no bad effects are expected.  Early adopters and testers may now
>   change the mode to either pypi-only or pypy-cache to help with
>   streamlining issues.  After implementation and UI issues are
>   streamlined, the next phase can start.
> 
> 2. We perform automatic analysis for each package to determine if it is
>   a package with externally hosted release files.  Packages which only 
>   have release files on pypi.python.org are put in the group "A",
>   those which have at least some packages outside are put in the group "B".
> 
>   We sent then a mail to all maintainers of packages in A 
>   that their hosting-mode is going to be switched automatically to 
>   "pypi-only" after N weeks, unless they visit their package
>   administration page earlier and set it to either pypi-cache or
>   pypi-only earlier.
> 
>   We sent then a mail to all maintainers of packages in B
>   that their hosting-mode is going to be switched automatically to 
>   "pypi-cache" after N weeks, unless they visit their package
>   administration page and set it to either pypi-only or
>   pypi-cache earlier.
> 
> 3. all packages will have a hosting mode of either "pypi-cache"
>   or "pypi-only", resulting in installers to only query
>   packages hosted through pypi.python.org.
> 
> 
> Transitioning to "pypi-cache" mode
> -------------------------------------
> 
> When transitioning from the currently implicit "pypi-ext" mode to
> "pypi-cache" for a given package, a package maintainer should 
> be able to retrieve/verify the historic release files which will 
> be cached from pypi.python.org.  The UI should present this list
> and have the maintainer accept it for completing the transition
> to the "pypi-cache" mode.  Upon future release registration actions,
> pypi.python.org will perform crawling for the homepage/download sites
> and cache release files *before* returning a success return code for
> the release registration.
> 
> 
> References
> ------------
> 
> .. [1] ratio of externally hosted versus pypi-hosted http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html
> 
> Acknowledgments
> ----------------------
> 
> Donald Stufft for pushing away from external hosting and doing
> the 90/10 % statistics script and offering to implement a PR.
> 
> Philip Eby for precise information and the basic idea to
> implement the transition via server-side changes only.
> 
> Marc-Andre Lemburg, Nick Coghlan and catalog-sig for thinking
> through issues regarding getting rid of "external hosting".
> 
> 
> Copyright
> -----------------
> 
> This document has been placed in the public domain.
> 
> 
> _______________________________________________
> Catalog-SIG mailing list
> Catalog-SIG at python.org
> http://mail.python.org/mailman/listinfo/catalog-sig

Some concerns:

1. We cannot automatically switch people to pypi-cache. We _have_ to get explicit permission from them.
2. The cache mechanism is going to be fragile, and in the long term leaves a window open for security issues.

If we're going to do a phased in per project solution like this I think it would work much better to have 2 modes.

1. Legacy - Current behavior, new external links are accepted, existing ones are displayed
2. PyPI Only - New behavior, no new external links are accepted, existing ones are removed

Present the project owners with 2 one way buttons:
   - Switch to PyPI Only and re-host external files [1]
   - Switch to PyPI Only and do NOT re-host external files

These buttons would be one time and quit. Once your project has been switched to PyPI Only you cannot go back to Legacy mode. All new projects would be already switched to PyPI Only. After some amount of time switch all Projects to PyPI Only but _do not_ re-host their packages as we cannot legally do so without their permission.

The above is simpler, still provides people an easy migration path, moves us to remove external hosting, and doesn't entangle us with legal issues.

[1] There is still a small window here where someone could MITM PyPI fetching these files, however since it would be a one time and down deal this risk is minimal and is worth it to move to an pypi only solution.

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/catalog-sig/attachments/20130310/e9e16f70/attachment.pgp>


More information about the Catalog-SIG mailing list