[Catalog-sig] V2 pre-PEP: transitioning to release file hosting on PYPI

M.-A. Lemburg mal at egenix.com
Tue Mar 12 17:06:26 CET 2013

On 12.03.2013 12:38, holger krekel wrote:
> Hi all,
> below is the new PEP pre-submit version (V2) which incorporates the
> latest suggestions and aims at a rapidly deployable solution.  Thanks in
> particular to Philip, Donald and Marc-Andre.  I also added a few notes
> on how installers should behave with respect to non-PYPI crawling.  
> I think a PEP like doc is warranted and that we should not silently
> change things without proper communication to maintainers and pre-planning
> the implementation/change process.  Arguably, the changes are more
> invasive than "oh, let's just do a http->https redirect" which didn't
> work too well either.
> Now, if there is some agreement, i can submit this PEP officially tomorrow,
> and given agreement/refinments from the Pycon folks and the likes of
> Richard, we may be able to get going very shortly after Pycon.
> cheers,
> holger
> PEP-draft: transitioning to release-file hosting on PYPI
> ====================================================================
> Status
> -----------
> Abstract
> ------------
> This PEP proposes a backward-compatible transition process to speed up,
> simplify and robustify installing from the pypi.python.org (PYPI)
> package index.  The initial transition will put most packages on PYPI
> automatically in a configuration mode which will prevent client-side
> crawling from installers.  To ease automatic transition and minimize
> client-side friction, **no changes to distutils or installation tools** are
> required.  Instead, the transition is implemented by modifying PYPI to
> serve links from ``simple/`` pages in a configurable way, preventing or
> allowing crawling of non-PYPI sites for detecting release files.
> Maintainers of all PYPI packages will be notified ahead of those
> changes.
> Maintainers of packages which currently are hosted on non-PYPI sites
> shall receive instructions and tools to ease "re-hosting" of their
> historic and future package release files.  The implementation of such
> tools is NOT required for implementing the initial automatic transition.
> Installation tools like pip and easy_install shall warn about crawling
> non-PYPI sites and later default to disallow it and only allow it with
> an explicit option.
> History and motivations for external hosting
> ------------------------------------------------
> When PYPI went online, it offered release registration but had no
> facility to host release files itself.  When hosting was added, no
> automated downloading tool existed yet.  When Philip Eby implemented
> automated downloading (through setuptools), he made the choice 
> to allow people to use download hosts of their choice.  This was
> implemented by the PYPI ``simple/`` index containing links of type
> ``rel=homepage`` or ``rel=download`` which are crawled by installation
> tools to discover package links.  As of March 2013, a substantial part 
> of packages (estimated to about 10%) make use of this mechanism to host
> files on github, bitbucket, sourceforge or own hosting sites like 
> ``mercurial.selenic.com``, to just name a few.
> There are many reasons [2]_ why people choose to use external hosting,
> to cite just a few:
> - release processes and scripts have been developed already and 
>   upload to external sites 
> - it takes too long to upload large files from some places in the world
> - export restrictions e.g. for crypto-related software
> - company policies which prescribe offering open source packages through
>   own sites
> - problems with integrating uploading to PYPI into one's release process
>   (because of release policies)
> - perceived bad reliability of PYPI
> - missing knowlege you can upload files 
> Irrespective of the present-day validity of these reasons, there clearly
> is a history why people choose to host files externally and it even was 
> for some time the only way you could do things.  
> Problem
> ---------------
> **Today, python package installers (pip and easy_install) often need to
> query non-PYPI sites even if there are no externally hosted files**.
> Apart from querying pypi.python.org's simple index pages, also all
> homepages and download pages ever specified with any release of a
> package are crawled by an installer.  The need for installers to
> crawl 3rd party sites slows down installation and makes for a brittle
> unreliable installation process.   Those sites and packages also don't 
> take part in the :pep:`381` mirroring infrastructure, further decreasing
> reliability and speed of automated installation processes around the world. 
> Roughly 90% of packages are hosted directly on pypi.python.org [1]_.
> Even for them installers still need to crawl the homepage(s) of a
> package.  Many package uploaders are particularly not aware that
> specifying the "homepage" in their release process will slow down 
> the installation process for all its users.
> Relying on third party sites also opens up more attack vectors
> for injecting malicious packages into sites using automated installs.  
> A simple attack might just involve getting hold of an old now-unused
> homepage domain and placing mailicious packages there.  Moreover,
> performing a Man-in-The-Middle (MITM) attack between an installation
> site and any of the download sites can inject mailicious packages on the
> installation site.  As many homepages and download locations are using
> HTTP and not proper HTTPS, such attacks are not very hard to launch.
> Such MITM attacks can happen even for packages which never intended to
> host files externally as their homepages are contacted by installers
> anyway.
> There is currently no way for package maintainers to avoid 3rd party
> crawling, other than removing all homepage/download url metadata
> for all historic releases.  While a script [3]_ has been written to 
> perform this action, it is not a good general solution because it removes
> semantic information like the "homepage" specification from PYPI packages.
> Solution
> -----------
> The proposed solution consists of the following implementation and
> communication steps:
> - determine which packages have releases files only on PYPI (group A)
>   and which have externally hosted release files (group B).
> - Prepare PYPI implementation to allow a per-project "hosting mode",
>   effectively enabling or disabling external crawling.  When enabled 
>   nothing changes from the current situation of producing ``rel=download`` 
>   and ``rel=homepage`` attributed links on ``simple/`` pages, 
>   causing installers to crawl those sites.  
>   When disabled, the attributions of links will change 
>   to ``rel=newdownload`` and ``rel=newhomepage`` causing installers to
>   avoid crawling 3rd party sites.  Retaining the meta-information allows
>   tools to still make use of the semantic information.

Please start using versioned APIs for these things. The
old style index should still be available under some
URL, e.g. /simple-v1/ or /v1/simple/ or /1/simple/

> - send mail to maintainers of A that their project is going to be 
>   automatically configured to "disable crawling" in one week
>   and encourage them to set this mode earlier to help all of 
>   their users.

One week ? That's a somewhat unrealistic timeframe.

I'm also missing some real-life tests to see what the effect
are on actual users, e.g. setup the new index using a
URL /simple-v2/ and let users play with it for a month
before making /simple/ == /simple-v2/.

> - send mail to maintainers of B that their package hosting mode 
>   is "crawling enabled", and list the sites which currently are crawled,
>   and suggest that they re-host their packages directly on PYPI and 
>   then switch the hosting-mode "disable crawling".  Provide instructions 
>   and at best tools to help with this "re-uploading" process.

That email should clearly state the PyPI terms to not
cause surprises among the maintainers.

I'd wait with this step until we've sorted out the PyPI terms
issues on the python-legal list, to not cause a an uproar
from people who get to read the terms for the first time ;-)

> In addition, maintainers of installation tools are asked to release
> two updates.  The first one shall provide clear warnings if external
> crawling needs to happen, for which projects and URLS exactly 
> this happens, and that in the future crawling will be disabled by default.  
> The next update shall change the default to disallow crawling and allow 
> crawling only with an explicit option like ``--crawl-externals`` and 
> another option allowing to limit which hosts are allowed to be crawled
> at all.

AFAIK, both already exist in easy_install. Not sure about pip.
They are not enable per default, though.

> Hosting-Mode state transitions
> ----------------------------------
> 1. At the outset, we set hosting-mode to "notset" for all packages.
>    This will not change any link served via the simple index and thus
>    no bad effects are expected.  Early adopters and testers may now
>    change the mode to either "crawl" or "nocrawl" to help with
>    streamlining issues in the PYPI implementation.
> 2. When maintainers of B packages are mailed their mode is directly
>    set to "crawl".
> 3. When maintainers of A are mailed we leave the mode at "notset" to allow
>    people to change it to "nocrawl" themselves or to set it to "crawl" 
>    if they think they are wrongly in the "A" group.  After a week 
>    all "notset" modes are set to "nocrawl".
> A week after the mailings all packages will be in "crawl" or "nocrawl"
> hosting mode.  It is then a matter of good tools and reaching out to
> maintainers of B packages to increase the A/B ratio.
> Open questions
> ----------------------
> - Should the support tools for "rehosting" packages be implemented  on the
>   server side or on the client side?  Implementing it on the client
>   side probably is quicker to get right and less fatal in terms of failures.

Not sure what you mean here.

Your are also completely leaving out the idea to only cache
distribution files on the PyPI CDN, without having to actually
upload them.

> - double-check if ``rel=newhomepage`` and ``rel=newdownload`` cause the 
>   desired behaviour of pip and easy_install (both the distribute and 
>   setuptools based one) to not crawl those pages.

Indeed :-)

Note that it will still be possible to add links to the
distribution files in the long description of the package.

Those links also show up on the /simple/ index page and
will then get used, regardless of whether they have a rel
attribute set or not.

> - are the "support tools" for re-hosting outside the scope of this PEP?

As with any PEP proposing an API change or a new API, it
has to provide a reference implementation.

The current distutils upload command is geared towards
uploading files at release time. While it is possible
to trick it into uploading existing distribution files,
it is not at all obvious how this is done.

> - Think some more about pip/easy_install "allow-hosts" mode etc.

Note that tools such as zc.buildout provide easy ways
of adding extra indexes and external URLs to scan for
distribution files.

I'm not sure how the above would fit such use cases,
i.e. if setuptools were to stop crawling external
links per default, this could mean that user hosted
PyPI-style indexes stop working with newer releases.

Here's an example list of indexes used in Plone 4.2:

# Add additional egg download sources here. dist.plone.org contains archives
# of Plone packages.
find-links =

None of these seem to use the rel attribute feature, so those
will likely continue to work fine.

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Mar 12 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

More information about the Catalog-SIG mailing list