[Distutils] PEP 470 Round 2 - Using Multi Index Support for External to PyPI Package File Hosting

Donald Stufft donald at stufft.io
Fri Jun 6 16:25:57 CEST 2014


On Jun 6, 2014, at 9:41 AM, holger krekel <holger at merlinux.eu> wrote:

> On Fri, Jun 06, 2014 at 07:55 -0400, Donald Stufft wrote:
>> 
>> On Jun 6, 2014, at 4:13 AM, holger krekel <holger at merlinux.eu> wrote:
>> 
>>> Hi Donald,
>>> 
>>> 1. you published numbers where <4K or <300 discounting PIL would be
>>>  affected by PEP470.  You also say that the main reason for deprecating
>>>  PEP438 is that it confused users.  Did it confuse other users than those few?
>> 
>> It confused more of than the current numbers because at the onset more
>> projects relied on it than does now. Currently PIL is the primary
>> instigator for people’s confusion that I personally see.
> 
> So currently we don't have many confused users anymore.  Doesn't
> this take away a good part of the reasoning behind PEP470?

No.

> 
> In the following i use "PEP438f" to speak about a hypothetical
> follow-up PEP as outlined in my previous mail.  I volunteer to write
> it and present it as an alternative should we not reach some 
> form of conclusion together.
> 
>>> 2. I don't see a valid precise reasoning why PEP438, just agreed on and 
>>>  implemented last year, needs deprecation.  It boosted everyone
>>>  everyone's install experiences (independently from the CDN which
>>>  brought another boost) as usage of crawling dramatically dropped 
>>>  and thus brings us into the exact situation PEP438 already hinted at:
>>> 
>>>  "Deprecation of hosting modes to eventually only allow the
>>>  pypi-explicit mode is NOT REGULATED by this PEP but is expected to
>>>  become feasible some time after successful implementation of the
>>>  transition phases described in this PEP. It is expected that
>>>  deprecation requires a new process to deal with abandoned packages
>>>  because of unreachable maintainers for still popular packages."
>>> 
>>>  We should follow through and discuss removing crawling and 
>>>  how to deal with abandoned packages.  On the PyPI side, what 
>>>  would remain are two kind of links:
>>> 
>>>  - pypi internally hosted
>>>  - registered safe external links to release files
>>> 
>>>  The resulting situation is:
>>> 
>>>  easy: users have an already existing option to consider to allow externals.
>>> 
>>>  safe: All links served from pypi have checksums. Project maintainers need
>>>        to register hashed links to their new release files.
>>> 
>>>  clean: Pip could eventually remove support for crawling/related options.
>>> 
>>>  This is all easy to do, reduces user confusion and makes pip
>>>  and pypi simpler and less suprising.
>>> 
>>>  I don't see this approach discussed or seriously considered in the PEP,
>>>  also not in its "rejection reasons”.
>> 
>> The reasons are listed in the PEP, though I can make it more explicit that
>> it is for this as well.
>> 
>> * People are generally surprised that PyPI allows externally linking to files
>>  and doesn't require people to host on PyPI. In contrast most of them are
>>  familiar with the concept of multiple software repositories such as is in
>>  use by many OSs.
> 
> "People are generally surprised" is a rather subjective statement.
> Wrt to PEP470 we might have at least 65 projects and many more users being 
> annoyed rather than just surprised at the sudden change in direction.
> Especially if there are no compelling arguments.
> 
>> * PyPI is fronted by a globally distributed CDN which has improved the
>>  reliability and speed for end users. It is unlikely that any particular
>>  external host has something comparable. This can lead to extremely bad
>>  performance for end users when the external host is located in different
>>  parts of the world or does not generally have good connectivity.
>> 
>>  As a data point, many users reported sub DSL speeds and latency when
>>  accessing PyPI from parts of Europe and Asia prior to the use of the CDN.
>> 
>> * PyPI has monitoring and an on-call rotation of sysadmins whom can respond to
>>  downtime quickly, thus enabling a quicker response to downtime. Again it is
>>  unlikely that any particular external host will have this. This can lead
>>  to single packages in a dependency chain being un-installable. This will
>>  often confuse users, who often times have no idea that this package relies
>>  on an external host, and they cannot figure out why PyPI appears to be up
>>  but the installer cannot find a package.
> 
> Sorry but both points have not much to do with the discussion.  If
> anything, they speak *against* PEP470 because users would need to rely
> on project specific external index sites to even know which releases
> exist.  With PEP438 you know that a certain release file must exist and
> the installer clearly says "i could not download release file X from
> URL".  Works today.
> 
> Also the external index could be temporarily broken and serve not the newest
> files.  The integrity and reliability of external indexes would generally
> not be covered by the CDN and PyPI's on-rotation admins so instead of
> speaking for PEP470 they speak against it.

The point is, end users are *aware* they are relying on something external
and they are aware exactly what external items they are relying on. With PEP 470
people can correctly assume that a pip install is going to be covered by the CDN
and our sysadmins unless they’ve gone out of their way to add an additional index.

Part of this, is that I consider —allow-all-external to be a UX failure. I originally thought
it would be alright, but given actual experience with it in the real world, I see constant
confusion about what it does.

> 
>> * PyPI supports mirroring, both for private organizations and public mirrors.
>>  The legal terms of uploading to PyPI ensure that mirror operators, both
>>  public and private, have the right to distribute the software found on PyPI.
>>  However software that is hosted externally does not have this, causing
>>  private organizations to need to investigate each package individually and
>>  manually to determine if the license allows them to mirror it.
>> 
>>  For public mirrors this essentially means that these externally hosted
>>  packages *cannot* be reasonably mirrored. This is particularly troublesome
>>  in countries such as China where the bandwidth to outside of China is
>>  highly congested making a mirror within China often times a massively better
>>  experience.
> 
> With PEP438 today, PyPI merely hosts a link with a checksum, not the file
> itself.  Mirrors also just mirror that link but not the file.
> How exactly is managing and mirroring a link a problem here?

The problem isn’t in the link, it’s in that you can’t automatically mirror the file,
which means that the primary way for an individual or organization to take
control over their own uptime is hampered by the fact that they cannot rely
on the PyPI ToS to allow them to mirror the file.

> 
>> * In the long run, global opt in flags like ``--allow-all-external`` will
>>  become little annoyances that developers cargo cult around in order to make
>>  their installer work. When they run into a project that requires it they
>>  will most likely simply add it to their configuration file for that installer
>>  and continue on with whatever they were actually trying to do. This will
>>  continue until they try to install their requirements on another computer
>>  or attempt to deploy to a server where their install will fail again until
>>  they add the "make it work" flag in their configuration file.
> 
> Well, with PEP470 developers would have to cargo cult around one or 
> more "--extra-index-url SOME_URL" options.  Also:

Ah, but you see the nature of ``—extra-index-url`` and ``—find-links`` is that
they *feel* more per-projecty. They are more likely to be added to the
requirements file because they are project specific than a global pip
configuration file. This isn’t cargo culting, this is configuring your project. The
problem with ``—allow-all-external`` is that lends itself more to being added
to the configuration file for the user (~/.pip/pip.conf) instead of the per
project requirements file. In one of the earlier discussions at least one person
stated that the first time they came across an externally hosted file they
would be likely to just chuck the “allow all external” bit into their pip.conf, making
it unlikely that they will even realize if they are depending or using on something
that requires external hosting.

> 
> - if a project wants to later change its index URL, existing users option
>  will break and the project has no way to communicate it.  With PEP438f
>  you can just update your hashes and be done.  

The project absolutely has a way to communicate it, first they can update
their metadata in PyPI which will trigger pip, when it cannot find anything
to satisfy that requirement, to say “hey you want to use this extra index”.
Secondly this is all HTTP, and HTTP has a very old and very well tested
way to communicate a change in the URL, the 301 redirect.

> 
> - if a project-specific index resides on a domain that changes ownership 
>  (e.g. project abandoned, maintainer gone), user are suddenly vulnerable
>  without knowing it. With PEP438f the trusted PyPI infrastructure
>  provides a checksummed link so new DNS owners can not break into
>  the user's computer as they can do with PEP470.

Sure, but the most likely outcome is that the domain goes missing at first,
and thanks to PEP 470, pip hard fails because it can’t find that index instead
of soft failing and the person removes the offending index and they are
perfectly safe. The window of attack is very small and will get smaller as
we implement other pieces such as package signing.

> 
> - if projects host on HTTP instead of HTTPS, users are 
>  vulnerable against MITM attacks whereas PEP438f provides a checksummed
>  link from PyPI which cannot be easily man-in-the-middled.

Which is why pip warns if you’re using HTTP instead of HTTPS, this will also
end up being mitigated by package signing.

> 
>> Implied but not explicitly called out reason (I’ll add this):
>> 
>> * The URL classification only works for a certain subset of projects, however
>>  it does not allow for any project which needs additional restrictions such
>>  as Access Controls. This means that there would be two methods of doing the
>>  same thing, linking to a file safely and hosting an index. Hosting an index
>>  works in all situations and by relying on this we make for a more consistent
>>  experience no matter the reason for external hosting.
> 
> Once you care for ACLs for indexes and releases you have a number
> of issues to consider, it's hardly related to PEP470/PEP438.

It is related, because it means that the exact same mechanisms can be used,
people don’t have to learn two different ways of specifying externally hosted
projects. In fact it also teaches them how to specify mirrors and the like as well
something that any devpi user is already going to have to learn how to do.

> 
>> Not implied, but I’ll add as well:
>> 
>> * The safe external hosting option hampers the ability of PyPI to upgrade it's
>>  security infrastructure. For instance if MD5 becomes broken in the future
>>  there will be no way for PyPI to upgrade the hashes of the projects which
>>  rely on safe external hosting via MD5 while files that are hosted on PyPI
>>  can simply be processed over with a new hash function.
> 
> This is a good point.  So we should make sure that it's easy to re-upload
> with different hashes and maybe store some more hashes (sha256 etc.) with
> an uploaded file and have pypi decide which one is used.

That assumes the maintainer is maintaining their software anymore.

> 
>> Not going to add:
>> 
>> Ultimately we're looking at maintaining an additional feature, both in PyPI
>> and in the installers, which almost nobody is actually taking advantage of. A
>> year ago I was hopeful that perhaps people would take advantage of it and maybe
>> that would solve some of the issues. However it's now been an entire year and
>> the buy-in for that feature is minuscule. I do not believe that it makes sense
>> to pay the cost for continuing to have that feature.
> 
> PEP438 and most of us urged people to use internal hosting and that's
> what people did.  That we have 65 projects using safe external hosting
> means we succeeded and there is a clear safe path to host your files
> externally.  Why did you expect more people to use external hosting
> when PyPI's current hosting is perceived as good and we told everyone
> to use it?

No, I expected more people to move to safe external vs staying with the unsafe
external. Strictly speaking there are only 35 projects which safely host their
*latest* version. Looking at the latest version is important because it is an
indicator as to what people plan on doing in the future. The 31 projects which
have *old* files hosted appear to all be projects where maybe one or two
releases were hosted externally (or perhaps only 1 file out of the release).

To put it bluntly, even if we call it 65 projects, that is not enough of a buy in
in my opinion to maintain this feature. People were given the chance to use
it, not enough did. It’s time to simplify the options.

> 
>> It costs time and effort to ensure that these features to not break. It also
>> adds to the cognitive burden of using the installers. We require end users to
>> learn two different ways of specifying they wish to allow external indexes
>> and require them to understand when or why they'd use one or the other. It is
>> my opinion that this makes the end user experience wholly worse.
> 
> However, we have a rather settled situation now and can easily 
> clean it up further: crawling can go, pypi-crawl can go, a number 
> of options go.  All easily done with PEP438f.
> 
>>>  By contrast, PEP470 would require many users to learn about
>>>  specifying other indexes and what that means.  For you and me
>>>  and many here on the list it may be a no-brainer but trust me,
>>>  for many users (i've done ten trainings touching the topic now)
>>>  this is not a natural concept at all.  "pip install --allow-all-externals" 
>>>  is far easier to convey than specifying extra per-project indexes and 
>>>  what it means if the install fails (wrong URL? Index noch reachable?
>>>  Release file not found?).
>> 
>> Some projects host their files on PyPI, some files do not, if the thing you're
>> trying to install doesn't host on PyPI you'll have to tell pip where it's
>> hosted. I don't really believe this is a difficult concept.
> 
> Well, with PEP470 you need to tell and explain users:
> "--extra-index-url <some_url_you_need_to_find_out>" versus
> "--allow-all-externals" with PEP438f.
> 
> I am pretty sure the second would win most usability contests.

I disagree. —allow-all-external is bad as I mentioned above, it leads
people to chuck it in their user config and forget about it whereas
—extra-index-url leads people to include it in their requirements.txt.

Additionally it’s not “some url you need to find out” it’s “some url
that we’re going to tell you”. This was detailed in the PEP, projects
would register what that URL is and then pip can present an error
message like:

Files for foo are located on an another index, add —external-index-url https://example.com/index/
to locate them.

This will work massively better than the per project option from PEP 438
as well. People get confused and don’t realize that they need to specify
the thing they are trying to install twice (pip install —allow-external foo foo).

> 
>>> 
>>> 3. PEP470 makes life a lot harder for devpi-server, currently used
>>>  by many companies for serving their private indexes.  With PEP438 and
>>>  almost no external crawling left, devpi-server can rely on seeing
>>>  changes through the PEP381 API.  By contrast, with projects hosted on
>>>  additional per-project external indexes, it requires polling to see
>>>  changes because releases may not be registered with PyPI anymore (and
>>>  there is no way to enforce that IISIC).  IOW, PEP470 is a serious
>>>  regression here as it doesn't allow getting notified on new release files.
>> 
>> That only works for the set of projects which are currently safely externally
>> hosted, which again is tiny.
> 
> Well, but it works.  With PEP470 it would not work anymore.  That was my point.
> And the question is anyway not how many projects but how many users are using
> externally hosted files.

And the answer is, barely any. Roughly 116 unique IP addresses on the day I examined
which represents 0.1% of the total number of unique IP addresses on that day. To be quite
honest it’s more likely that those projects were grabbed by something automated that
was canning the whole thing rather than an end user actually trying to install them. But
even if it is the case that they were end users, 0.1% is not enough of a buy in to justify
this special case to exist. We have a mechanism for specifying external locations to
fetch from, it’s existed for a long time, we don’t need a special one for an ultra minority.

> 
>> It does not work at all for projects which are
>> hosted externally nor does it work for projects which require some sort of
>> ACL.
> 
> I don't quite understand why you are talking about ACLs in context of our
> discussion.

As I said above, because it matters, the less concepts people have to understand
the easier the tooling is to use. If there are two different ways to host externally
then that’s two different concepts users have to learn.

> 
>> Additionally I don't believe devpi *can* actually do that, as it has no way
>> of knowing if it's even legal for it to be mirroring the files if they are
>> not hosted on PyPI. This is one of the points of the PEP, there is a known
>> legal right to distribute files hosted on PyPI, no such right is promised for
>> any file not hosted on PyPI. It's completely possible that devpi is doing
>> something that it has no legal right to do in those cases.
> 
> If I use devpi on my laptop or in an organisation I am using devpi as a
> private http cache and not serving my cached files to 3rd parties.

This would depend greatly upon the license of the theoretical library. IANAL
but it is my understanding that it’s completely possible and even reasonable
for there to exist licenses where this is not allowed. The point being that
by blurring the lines it makes it more difficult for people who have to make sure
they are following the license terms because they have something to lose if
they don’t whereas providing clear separation makes it so much easier.

> 
> best,
> holger
> 
>>> 
>>> best,
>>> holger
>>> 
>>> On Thu, Jun 05, 2014 at 22:08 -0400, Donald Stufft wrote:
>>>> Here's round 2 of PEP 470.
>>>> 
>>>> You can see it online at https://python.org/dev/peps/pep-0470/ or below.
>>>> 
>>>> Notable changes:
>>>> 
>>>> - Ensure it's obvious this strictly deals with the installer API and does not
>>>> affect a project's ability to register their project on PyPI for human
>>>> consumptions.
>>>> 
>>>> - Mention that the functional mechanisms that make it possible for an end user
>>>> to specify the additional locations have existed for a long time across many
>>>> versions of the installers.
>>>> 
>>>> - Explicitly mention that the installer changes from PEP 438 should be
>>>> deprecated and removed as part of this PEP.
>>>> 
>>>> - Explicitly mention pythonhosted.org as a location that authors can use to
>>>> host an index if they do not wish to purchase a TLS certificate or host
>>>> additional infrastructure.
>>>> 
>>>> - Include that a link to PyPI ToS should be included in the emails sent to
>>>> authors to remind them of the PyPI ToS.
>>>> 
>>>> - Special case PIL as it is an outlier in terms of impact.
>>>> 
>>>> - Fill out the impact sections further to provide more detail
>>>> 
>>>> 
>>>> Abstract
>>>> ========
>>>> 
>>>> This PEP proposes that the official means of having an installer locate and
>>>> find package files which are hosted externally to PyPI become the use of
>>>> multi index support instead of the practice of using external links on the
>>>> simple installer API.
>>>> 
>>>> It is important to remember that this is **not** about forcing anyone to host
>>>> their files on PyPI. If someone does not wish to do so they will never be under
>>>> any obligation too. They can still list their project in PyPI as an index, and
>>>> the tooling will still allow them to host it elsewhere.
>>>> 
>>>> This PEP strictly is concerned with the Simple Installer API and how automated
>>>> installers interact with PyPI, it has no bearing on the informational pages
>>>> which are primarily for human consumption.
>>>> 
>>>> 
>>>> Rationale
>>>> =========
>>>> 
>>>> There is a long history documented in PEP 438 that explains why externally
>>>> hosted files exist today in the state that they do on PyPI. For the sake of
>>>> brevity I will not duplicate that and instead urge readers to first take a look
>>>> at PEP 438 for background.
>>>> 
>>>> There are currently two primary ways for a project to make itself available
>>>> without directly hosting the package files on PyPI. They can either include
>>>> links to the package files in the simpler installer API or they can publish
>>>> a custom package index which contains their project.
>>>> 
>>>> 
>>>> Custom Additional Index
>>>> -----------------------
>>>> 
>>>> Each installer which speaks to PyPI offers a mechanism for the user invoking
>>>> that installer to provide additional custom locations to search for files
>>>> during the dependency resolution phase. For pip these locations can be
>>>> configured per invocation, per shell environment, per requirements file, per
>>>> virtual environment, and per user. The mechanism for specifying additional
>>>> locations have existed within pip and setuptools for many years, by comparison
>>>> the mechanisms in PEP 438 and any other new mechanism will have existed for
>>>> only a short period of time (if they exist at all currently).
>>>> 
>>>> The use of additional indexes instead of external links on the simple
>>>> installer API provides a simple clean interface which is consistent with the
>>>> way most Linux package systems work (apt-get, yum, etc). More importantly it
>>>> works the same even for projects which are commercial or otherwise have their
>>>> access restricted in some form (private networks, password, IP ACLs etc)
>>>> while the external links method only realistically works for projects which
>>>> do not have their access restricted.
>>>> 
>>>> Compared to the complex rules which a project must be aware of to prevent
>>>> themselves from being considered unsafely hosted setting up an index is fairly
>>>> trivial and in the simplest case does not require anything more than a
>>>> filesystem and a standard web server such as Nginx or Twisted Web. Even if
>>>> using simple static hosting without autoindexing support, it is still
>>>> straightforward to generate appropriate index pages as static HTML.
>>>> 
>>>> Example Index with Twisted Web
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> 
>>>> 1. Create a root directory for your index, for the purposes of the example
>>>>  I'll assume you've chosen ``/var/www/index.example.com/``.
>>>> 2. Inside of this root directory, create a directory for each project such
>>>>  as ``mkdir -p /var/www/index.example.com/{foo,bar,other}/``.
>>>> 3. Place the package files for each project in their respective folder,
>>>>  creating paths like ``/var/www/index.example.com/foo/foo-1.0.tar.gz``.
>>>> 4. Configure Twisted Web to serve the root directory, ideally with TLS.
>>>> 
>>>> ::
>>>> 
>>>>   $ twistd -n web --path /var/www/index.example.com/
>>>> 
>>>> 
>>>> Examples of Additional indexes with pip
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> 
>>>> **Invocation:**
>>>> 
>>>> ::
>>>> 
>>>>   $ pip install --extra-index-url https://pypi.example.com/ foobar
>>>> 
>>>> **Shell Environment:**
>>>> 
>>>> ::
>>>> 
>>>>   $ export PIP_EXTRA_INDEX_URL=https://pypi.example.com/
>>>>   $ pip install foobar
>>>> 
>>>> **Requirements File:**
>>>> 
>>>> ::
>>>> 
>>>>   $ echo "--extra-index-url https://pypi.example.com/\nfoobar" > requirements.txt
>>>>   $ pip install -r requirements.txt
>>>> 
>>>> **Virtual Environment:**
>>>> 
>>>> ::
>>>> 
>>>>   $ python -m venv myvenv
>>>>   $ echo "[global]\nextra-index-url = https://pypi.example.com/" > myvenv/pip.conf
>>>>   $ myvenv/bin/pip install foobar
>>>> 
>>>> **User:**
>>>> 
>>>> ::
>>>> 
>>>>   $ echo "[global]\nextra-index-url = https://pypi.example.com/" >~/.pip/pip.conf
>>>>   $ pip install foobar
>>>> 
>>>> 
>>>> External Links on the Simple Installer API
>>>> ------------------------------------------
>>>> 
>>>> PEP 438 proposed a system of classifying file links as either internal,
>>>> external, or unsafe. It recommended that by default only internal links would
>>>> be installed by an installer however users could opt into external links on
>>>> either a global or a per package basis. Additionally they could also opt into
>>>> unsafe links on a per package basis.
>>>> 
>>>> This system has turned out to be *extremely* unfriendly towards the end users
>>>> and it is the position of this PEP that the situation has become untenable. The
>>>> situation as provided by PEP 438 requires an end user to be aware not only of
>>>> the difference between internal, external, and unsafe, but also to be aware of
>>>> what hosting mode the package they are trying to install is in, what links are
>>>> available on that project's /simple/ page, whether or not those links have
>>>> a properly formatted hash fragment, and what links are available from pages
>>>> linked to from that project's /simple/ page.
>>>> 
>>>> There are a number of common confusion/pain points with this system that I
>>>> have witnessed:
>>>> 
>>>> * Users unaware what the simple installer api is at all or how an installer
>>>> locates installable files.
>>>> * Users unaware that even if the simple api links to a file, if it does
>>>> not include a ``#md5=...`` fragment that it will be counted as unsafe.
>>>> * Users unaware that an installer can look at pages linked from the
>>>> simple api to determine additional links, or that any links found in this
>>>> fashion are considered unsafe.
>>>> * Users are unaware and often surprised that PyPI supports hosting your files
>>>> someplace other than PyPI at all.
>>>> 
>>>> In addition to that, the information that an installer is able to provide
>>>> when an installation fails is pretty minimal. We are able to detect if there
>>>> are externally hosted files directly linked from the simple installer api,
>>>> however we cannot detect if there are files hosted on a linked page without
>>>> fetching that page and doing so would cause a massive performance hit just to
>>>> see if there might be a file there so that a better error message could be
>>>> provided.
>>>> 
>>>> Finally very few projects have properly linked to their external files so that
>>>> they can be safely downloaded and verified. At the time of this writing there
>>>> are a total of 65 projects which have files that are only available externally
>>>> and are safely hosted.
>>>> 
>>>> The end result of all of this, is that with PEP 438, when a user attempts to
>>>> install a file that is not hosted on PyPI typically the steps they follow are:
>>>> 
>>>> 1. First, they attempt to install it normally, using ``pip install foobar``.
>>>>  This fails because the file is not hosted on PyPI and PEP 438 has us default
>>>>  to only hosted on PyPI. If pip detected any externally hosted files or other
>>>>  pages that we *could* have attempted to find other files at it will give an
>>>>  error message suggesting that they try ``--allow-external foobar``.
>>>> 2. They then attempt to install their package using
>>>>  ``pip install --allow-external foobar foobar``. If they are lucky foobar is
>>>>  one of the packages which is hosted externally and safely and this will
>>>>  succeed. If they are unlucky they will get a different error message
>>>>  suggesting that they *also* try ``--allow-unverified foobar``.
>>>> 3. They then attempt to install their package using
>>>>  ``pip install --allow-external foobar --allow-unverified foobar foobar``
>>>>  and this finally works.
>>>> 
>>>> This is the same basic steps that practically everyone goes through every time
>>>> they try to install something that is not hosted on PyPI. If they are lucky it'll
>>>> only take them two steps, but typically it requires three steps. Worse there is
>>>> no real indication to these people why one package might install after two
>>>> but most require three. Even worse than that most of them will never get an
>>>> externally hosted package that does not take three steps, so they will be
>>>> increasingly annoyed and frustrated at the intermediate step and will likely
>>>> eventually just start skipping it.
>>>> 
>>>> 
>>>> External Index Discovery
>>>> ========================
>>>> 
>>>> One of the problems with using an additional index is one of discovery. Users
>>>> will not generally be aware that an additional index is required at all much
>>>> less where that index can be found. Projects can attempt to convey this
>>>> information using their description on the PyPI page however that excludes
>>>> people who discover their project organically through ``pip search``.
>>>> 
>>>> To support projects that wish to externally host their files and to enable
>>>> users to easily discover what additional indexes are required, PyPI will gain
>>>> the ability for projects to register external index URLs and additionally an
>>>> associated comment for each. These URLs will be made available on the simple
>>>> page however they will not be linked or provided in a form that older
>>>> installers will automatically search them.
>>>> 
>>>> When an installer fetches the simple page for a project, if it finds this
>>>> additional meta-data and it cannot find any files for that project in it's
>>>> configured URLs then it should use this data to tell the user how to add one
>>>> or more of the additional URLs to search in. This message should include any
>>>> comments that the project has included to enable them to communicate to the
>>>> user and provide hints as to which URL they might want if some are only
>>>> useful or compatible with certain platforms or situations. When the installer
>>>> has implemented the auto discovery mechanisms they should also deprecate any
>>>> of the mechanisms added for PEP 438 (such as ``--allow-external``) for removal
>>>> at the end of the deprecation period proposed by the PEP.
>>>> 
>>>> This feature *must* be added to PyPI prior to starting the deprecation and
>>>> removal process for link spidering.
>>>> 
>>>> 
>>>> Deprecation and Removal of Link Spidering
>>>> =========================================
>>>> 
>>>> A new hosting mode will be added to PyPI. This hosting mode will be called
>>>> ``pypi-only`` and will be in addition to the three that PEP 438 has already
>>>> given us which are ``pypi-explicit``, ``pypi-scrape``, ``pypi-scrape-crawl``.
>>>> This new hosting mode will modify a project's simple api page so that it only
>>>> lists the files which are directly hosted on PyPI and will not link to anything
>>>> else.
>>>> 
>>>> Upon acceptance of this PEP and the addition of the ``pypi-only`` mode, all new
>>>> projects will by defaulted to the PyPI only mode and they will be locked to
>>>> this mode and unable to change this particular setting. ``pypi-only`` projects
>>>> will still be able to register external index URLs as described above - the
>>>> "pypi-only" refers only to the download links that are published directly on
>>>> PyPI.
>>>> 
>>>> An email will then be sent out to all of the projects which are hosted only on
>>>> PyPI informing them that in one month their project will be automatically
>>>> converted to the ``pypi-only`` mode. A month after these emails have been sent
>>>> any of those projects which were emailed, which still are hosted only on PyPI
>>>> will have their mode set to ``pypi-only``.
>>>> 
>>>> After that switch, an email will be sent to projects which rely on hosting
>>>> external to PyPI. This email will warn these projects that externally hosted
>>>> files have been deprecated on PyPI and that in 6 months from the time of that
>>>> email that all external links will be removed from the installer APIs. This
>>>> email *must* include instructions for converting their projects to be hosted
>>>> on PyPI and *must* include links to a script or package that will enable them
>>>> to enter their PyPI credentials and package name and have it automatically
>>>> download and re-host all of their files on PyPI. This email *must also*
>>>> include instructions for setting up their own index page and registering that
>>>> with PyPI, including the fact that they can use pythonhosted.org as a host
>>>> for an index page without requiring them to host any additional infrastructure
>>>> or purchase a TLS certificate. This email must also contain a link to the Terms
>>>> of Service for PyPI as many users may have signed up a long time ago and may
>>>> not recall what those terms are.
>>>> 
>>>> Five months after the initial email, another email must be sent to any projects
>>>> still relying on external hosting. This email will include all of the same
>>>> information that the first email contained, except that the removal date will
>>>> be one month away instead of six.
>>>> 
>>>> Finally a month later all projects will be switched to the ``pypi-only`` mode
>>>> and PyPI will be modified to remove the externally linked files functionality.
>>>> At this point in time any installers should finally remove any of the
>>>> deprecated PEP 438 functionality such as ``--allow-external`` and
>>>> ``--allow-unverified`` in pip.
>>>> 
>>>> 
>>>> PIL
>>>> ---
>>>> 
>>>> It's obvious from the numbers below that the vast bulk of the impact come from
>>>> the PIL project. On 2014-05-17 an email was sent to the contact for PIL
>>>> inquiring whether or not they would be willing to upload to PyPI. A response
>>>> has not been received as of yet (2014-06-05) nor has any change in the hosting
>>>> happened. Due to the popularity of PIL this PEP also proposes that during the
>>>> deprecation period that PyPI Administrators will set the PIL download URL as
>>>> the external index for that project. Allowing the users of PIL to take
>>>> advantage of the auto discovery mechanisms although the project has seemingly
>>>> become unmaintained.
>>>> 
>>>> 
>>>> Impact
>>>> ======
>>>> 
>>>> The largest impact of this is going to be projects where the maintainers are
>>>> no longer maintaining the project, for one reason or another. For these
>>>> projects it's unlikely that a maintainer will arrive to set the external index
>>>> metadata which would allow the auto discovery mechanism to find it.
>>>> 
>>>> Looking at the numbers factoring out PIL (which has been special cased above)
>>>> the actual impact should be quite low, with it affecting just 6.9% of projects
>>>> which host only externally or 2.8% which have their latest version hosted
>>>> externally. This represents a mere 3883 unique IP addresses. The break down of
>>>> this is that of those 3883 addresses, 100% of them installed something that
>>>> could not be verified while only 3% installed something which could be.
>>>> 
>>>> 
>>>> Projects Which Rely on Externally Hosted files
>>>> ----------------------------------------------
>>>> 
>>>> This is determined by crawling the simple index and looking for installable
>>>> files using a similar detection method as pip and setuptools use. The "latest"
>>>> version is determined using ``pkg_resources.parse_version`` sort order and it
>>>> is used to show whether or not the latest version is hosted externally or only
>>>> old versions are.
>>>> 
>>>> ============ ======= ================ =================== =======
>>>> \             PyPI    External (old)   External (latest)   Total
>>>> ============ ======= ================ =================== =======
>>>> **Safe**     38716   31               35                  38782
>>>> **Unsafe**   0       1659             1169                2828
>>>> **Total**    38716   1690             1204                41610
>>>> ============ ======= ================ =================== =======
>>>> 
>>>> 
>>>> Top Externally Hosted Projects by Requests
>>>> ------------------------------------------
>>>> 
>>>> This is determined by looking at the number of requests the
>>>> ``/simple/<project>/`` page had gotten in a single day. The total number of
>>>> requests during that day was 17,960,467.
>>>> 
>>>> ============================== ========
>>>> Project                        Requests
>>>> ============================== ========
>>>> PIL                            13470
>>>> mysql-connector-python         321
>>>> salesforce-python-toolkit      54
>>>> pyodbc                         50
>>>> elementtree                    44
>>>> atfork                         39
>>>> RBTools                        29
>>>> django-contrib-requestprovider 28
>>>> wadofstuff-django-serializers  23
>>>> Pygame                         21
>>>> ============================== ========
>>>> 
>>>> 
>>>> Top Externally Hosted Projects by Unique IPs
>>>> --------------------------------------------
>>>> 
>>>> This is determined by looking at the IP addresses of requests the
>>>> ``/simple/<project>/`` page had gotten in a single day. The total number of
>>>> unique IP addresses during that day was 105,587.
>>>> 
>>>> ============================== ==========
>>>> Project                        Unique IPs
>>>> ============================== ==========
>>>> PIL                            3515
>>>> mysql-connector-python         117
>>>> pyodbc                         34
>>>> elementtree                    21
>>>> RBTools                        19
>>>> egenix-mx-base                 16
>>>> Pygame                         14
>>>> salesforce-python-toolkit      13
>>>> django-contrib-requestprovider 12
>>>> wxPython                       11
>>>> python-apt                     10
>>>> ============================== ==========
>>>> 
>>>> 
>>>> Rejected Proposals
>>>> ==================
>>>> 
>>>> Keep the current classification system but adjust the options
>>>> -------------------------------------------------------------
>>>> 
>>>> This PEP rejects several related proposals which attempt to fix some of the
>>>> usability problems with the current system but while still keeping the
>>>> general gist of PEP 438.
>>>> 
>>>> This includes:
>>>> 
>>>> * Default to allowing safely externally hosted files, but disallow unsafely
>>>> hosted.
>>>> * Default to disallowing safely externally hosted files with only a global
>>>> flag to enable them, but disallow unsafely hosted.
>>>> 
>>>> These proposals are rejected because:
>>>> 
>>>> * The classification "system" is complex, hard to explain, and requires an
>>>> intimate knowledge of how the simple API works in order to be able to reason
>>>> about which classification is required. This is reflected in the fact that
>>>> the code to implement it is complicated and hard to understand as well.
>>>> 
>>>> * People are generally surprised that PyPI allows externally linking to files
>>>> and doesn't require people to host on PyPI. In contrast most of them are
>>>> familiar with the concept of multiple software repositories such as is in
>>>> use by many OSs.
>>>> 
>>>> * PyPI is fronted by a globally distributed CDN which has improved the
>>>> reliability and speed for end users. It is unlikely that any particular
>>>> external host has something comparable. This can lead to extremely bad
>>>> performance for end users when the external host is located in different
>>>> parts of the world or does not generally have good connectivity.
>>>> 
>>>> As a data point, many users reported sub DSL speeds and latency when
>>>> accessing PyPI from parts of Europe and Asia prior to the use of the CDN.
>>>> 
>>>> * PyPI has monitoring and an on-call rotation of sysadmins whom can respond to
>>>> downtime quickly, thus enabling a quicker response to downtime. Again it is
>>>> unlikely that any particular external host will have this. This can lead
>>>> to single packages in a dependency chain being un-installable. This will
>>>> often confuse users, who often times have no idea that this package relies
>>>> on an external host, and they cannot figure out why PyPI appears to be up
>>>> but the installer cannot find a package.
>>>> 
>>>> * PyPI supports mirroring, both for private organizations and public mirrors.
>>>> The legal terms of uploading to PyPI ensure that mirror operators, both
>>>> public and private, have the right to distribute the software found on PyPI.
>>>> However software that is hosted externally does not have this, causing
>>>> private organizations to need to investigate each package individually and
>>>> manually to determine if the license allows them to mirror it.
>>>> 
>>>> For public mirrors this essentially means that these externally hosted
>>>> packages *cannot* be reasonably mirrored. This is particularly troublesome
>>>> in countries such as China where the bandwidth to outside of China is
>>>> highly congested making a mirror within China often times a massively better
>>>> experience.
>>>> 
>>>> * Installers have no method to determine if they should expect any particular
>>>> URL to be available or not. It is not unusual for the simple API to reference
>>>> old packages and URLs which have long since stopped working. This causes
>>>> installers to have to assume that it is OK for any particular URL to not be
>>>> accessible. This causes problems where an URL is temporarily down or
>>>> otherwise unavailable (a common cause of this is using a copy of Python
>>>> linked against a really ancient copy of OpenSSL which is unable to verify
>>>> the SSL certificate on PyPI) but it *should* be expected to be up. In this
>>>> case installers will typically silently ignore this URL and later the user
>>>> will get a confusing error stating that the installer couldn't find any
>>>> versions instead of getting the real error message indicating that the URL
>>>> was unavailable.
>>>> 
>>>> * In the long run, global opt in flags like ``--allow-all-external`` will
>>>> become little annoyances that developers cargo cult around in order to make
>>>> their installer work. When they run into a project that requires it they
>>>> will most likely simply add it to their configuration file for that installer
>>>> and continue on with whatever they were actually trying to do. This will
>>>> continue until they try to install their requirements on another computer
>>>> or attempt to deploy to a server where their install will fail again until
>>>> they add the "make it work" flag in their configuration file.
>>>> 
>>>> 
>>>> -----------------
>>>> Donald Stufft
>>>> PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
>>>> 
>>> 
>>> 
>>> 
>>>> _______________________________________________
>>>> Distutils-SIG maillist  -  Distutils-SIG at python.org
>>>> https://mail.python.org/mailman/listinfo/distutils-sig
>>> 
>> 
>> 
>> -----------------
>> Donald Stufft
>> PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20140606/1d29f85f/attachment-0001.sig>


More information about the Distutils-SIG mailing list