[Distutils] PEP470, backward compat is a ...

Donald Stufft donald at stufft.io
Fri May 16 14:20:52 CEST 2014


On May 16, 2014, at 8:06 AM, holger krekel <holger at merlinux.eu> wrote:

> On Fri, May 16, 2014 at 07:20 -0400, Donald Stufft wrote:
>> On May 16, 2014, at 6:16 AM, holger krekel <holger at merlinux.eu> wrote:
>> 
>>> Hi Donald, Nick, Richard, all,
>>> 
>>> finally got around to read and think about the issues discussed in PEP470.  
>>> First of all thanks for going through the effort of trying to 
>>> advance the overall situation with a focus on making it easier 
>>> for our wonderful and beloved "end users" :)
>>> 
>>> However, I think PEP470 needs to achieve stronger backward compatibility for
>>> end-users because, as is typical for the 99%, they like to see change
>>> but hate to be forced to change themselves.
>>> 
>>> Allow me to remind of how PEP438 worked in this regard: all
>>> end users always remained able to install all projects, including those
>>> with ancient tools and they all benefitted from the changes PEP438
>>> brought: 90% of the projects were automatically switched to
>>> "pypi-explicit" mode, speeding up and making more reliable installs for
>>> everyone across the board.  Let me thank specifically and once
>>> again our grand tooler Donald here who implemented most of it.
>>> 
>>> However, PEP470 does not achieve this level of backward compatibility yet.
>>> Let's look at its current procedure leading up to the final switch:
>>> 
>>>   "After that switch, an email will be sent to projects which rely on
>>>   hosting external to PyPI. This email will warn these projects that
>>>   externally hosted files have been deprecated on PyPI and that in 6
>>>   months from the time of that email that all external links will be
>>>   removed from the installer APIs. (...)
>>> 
>>>   Five months after the initial email, another email must be sent to
>>>   any projects still relying on external hosting. (...)
>>> 
>>>   Finally a month later all projects will be switched to the pypa-only
>>>   mode and PyPI will be modified to remove the externally linked files
>>>   functionality."
>>> 
>>> This process tries to trigger changes from those 2974 project maintainers 
>>> who are today operating in pypi-crawl* modes.  If we are left with a 1000 
>>> stale project maintainers at final-switch time, and speculate about just 100 
>>> downloads for each of their projects, it means this final switch may get 
>>> us 100000 failing installation interactions the day after the final switch.  
>>> Might be higher or lower, but i hope we agree that we'll very likely 
>>> have a significant "stale project maintainer" problem affecting 
>>> many end-users and existing CI installations etc.
>>> 
>>> Even for those maintainers who switch to use an external index
>>> as currently advertised by the PEP, and with their release files also
>>> being downloaded a 100 times each, we'll have another 50000 interactions 
>>> from end users which need to re-configure their tool usage to switch to 
>>> use an external index.  Granted, those using a new pip version would get 
>>> a useful hint how to do that.  Others, using older versions, would have 
>>> to discover the project pypi website to hopefully understand how to 
>>> make their stuff work again.
>>> 
>>> In any case, we'd likely get a ton of end-user side installation issues 
>>> and i think PEP470 needs to be modified to try minimize this number.
>>> It could take the ball where PEP438 dropped it:
>>> 
>>>   "Thus the hope is that eventually all projects on PyPI can be migrated to
>>>   the pypi-explicit mode, while preserving the ability to install release
>>>   files hosted externally via installer tools. Deprecation of hosting
>>>   modes to eventually only allow the pypi-explicit mode is NOT REGULATED
>>>   by this PEP but is expected to become feasible some time after
>>>   successful implementation of the transition phases described in this
>>>   PEP. It is expected that deprecation requires a new process to deal with
>>>   abandoned packages because of unreachable maintainers for still popular
>>>   packages."
>>> 
>>> PEP470 could be this successor, cleaning up and simplifying the situation.
>>> But how to maintain full backward compat and get rid of crawling? 
>>> here is a sketched process how we could get rid of pypi-crawl* modes:
>>> 
>>> - sent a warning note to maintainers a month before their pypi-crawl* 
>>> hosted projects are converted (informing about the process, see next points).
>>> Advertise a tool to convert pypi-crawl* hosting modes to pypi-explicit.
>>> This tool automates the crawling to register all found release files
>>> either as explicit references with MD5s, or upload them to become 
>>> pypi-hosted files, at the option of the maintainer.  It will also switch
>>> the hosting mode on the pypi site automatically.
>>> 
>>> We'll also disallow pypi-crawl* modes on pypi at warning time for new
>>> projects or to switch to them from other modes.
>>> 
>>> - a month later a pypi admin (guess who!) uses the same conversion tool,
>>> but with his admin superpowers, to convert any remaining 
>>> pypi-crawl* hosting-mode projects automatically with one addition:
>>> all those admin-converted projects will get a "stale" flag
>>> because the maintainer did not react and perform the conversion himself.
>>> This "stale" status will be shown on the web page and new tool releases
>>> can maybe learn to read this flag from the simple page so that they can warn 
>>> the end users they are installing a project with a known-to-be stale 
>>> maintainer.
>>> 
>>> The admin-driven conversion can be done incrementally in bunches,
>>> to make it even more unlikely that we are going to face storms 
>>> of unhappy end users at any one point and to iron out issues as we go.
>>> 
>>> The result of this process is that we have only one hosting mode: 
>>> pypi-explicit which is already introduced and specified with PEP438. 
>>> And pypi's simple pages will continue to present two kinds of links:
>>> 
>>> - rel="internal": release files directly uploaded to pypi
>>> 
>>> - other external links will be direct URLS with hash-checksums to external
>>> release files.  Tools already can already recognize them and inform the user.
>>> 
>>> sidenote: if people have a PIP_DOWNLOAD_CACHE they will
>>> only depend on reachability of pypi after they first installed
>>> an external dependency.  So it's operationally a good situation given
>>> the fact that using "--allow-externals" provides exactly the same 
>>> file installation integrity as pypi hosted files itself do.
>>> 
>>> After we completed the automated admin-pypi transition there is no external
>>> scraping, no unverified links and tools could drop support for them over
>>> time.  And there remain two ways how you can release files:  upload them
>>> to pypi or register a checksummed link.   In addition, we will have 
>>> a clear list of a bunch of "stale" marked projects and can work 
>>> with it further.
>>> 
>>> Note that with this proposed process 93% of maintainers, most toolers
>>> and all end-users can remain ignorant of this PEP and will not be
>>> bothered: everything just continues to work unmodified.  Some end users
>>> will experience a speed up because the client-side will not need
>>> to download/crawl additional external simple pages.  There are no new
>>> things people need to learn except for the "crawl" maintainers to whom
>>> we nicely and empathically send a message: "switch or be switched" :)
>>> 
>>> You'll note that the process proposed here does not require
>>> pypi.python.org to manage "external additional indexes" information or
>>> tools to learn to recognize them.  At this point, I am not sure it's 
>>> really needed for the cleanup and simplifiation issues PEP470 tries to 
>>> address.
>>> 
>>> backward-compat-is-a-thing'ly yours,
>>> holger
>> 
>> Backwards compatibility is a noble goal! It is not however the only goal.
>> 
>> I feel very strongly that PyPI should not make security sensitive claims about
>> a project it does not know to be true. Here's the thing, we do not know if
>> the files we discover are safe files and we have no way to verify them. We
>> don't even know that the original author still owns the domain and someone
>> hasn't bought it up and put malicious files on them. Your proposal will change
>> it so that PyPI will make security claims about a project without actually
>> being able to actually know that those claims are accurate.
> 
> That's indeed an issue of my proposal: when we convert pypi-crawl* 
> mode projects, dragging files from external sites, checksumming them
> and providing links on the simple index we capture the snapshot at that
> point in time.
> 
> An updated pip should warn about such "stale project" files given pypi
> marks them on the simple page.
> 
> Older versions would continue to install it if they provide a
> ``--allow-external`` which already allows to become aware that something
> is coming from an external site that they should be careful about.
> 
> The current "safety" gurantees pypi can make about the millions of its own
> release files are anyway week: the best we can hope for is that it is serving
> the same files that were uploaded (and that could well not be the case,
> given many of them were uploaded in http times, or that someone could
> have broken into pypi in the last years and modified files).  It's really
> a weak integrity we are providing with fingers crossed.  btw,
> was pypi upload affected by heartbleed btw?

Uploading was not vulnerable to heart bleed, but only because uploading
doesn’t generally use HTTPS at all yet.

The likelihood that one of many hosts, especially given that many of them
have expired domain names attached to them, is compromised is far
greater than that of PyPI. If I was a malicious actor the first thing I would
do upon hearing your proposal is go and look for any project which get's
any traffic also lists an external domain that has since expired. Then I would
register that domain and put files up on it that would be the latest version
and is API compatible with the old latest version except it includes malicious
software.

It is a big deal that we have no idea who owns those external hosts and if
they are still owned by the original owner.

> 
>> On top of that, it still fails to address:
>> 
>> * The reliability of the externally hosted files, especially for projects which
>>  are now "stale". How likely is it that an unmaintained project ends up having
>>  it's external file links bitrot?
> 
> I noted above the use of PIP_DOWNLOAD_CACHE to help with reliability.
> People can also use devpi-server to cache external files locally and
> become less dependent on availability of external sites if needed. 
> 
> Note that requiring external indices has a harder reliability problem:
> an according install will need to get the simple page even if it uses
> PIP_DOWNLOAD_CACHE.

pip 1.6 removes the download cache and replaces it with an on by default HTTP
cache that respects cache headers.

Requiring external indexes does not have a *harder* reliability problem, it
has the same problem except it’s much more explicit and lends projects to
host on PyPI unless they have a good reason not to. 

> 
>> * The legality of mirroring. End users trying to mirror are still responsible
>>  for determining if they are able to mirror this file. This is especially
>>  important in China or other bandwidth constrained environments where good
>>  access (or access at all) to the Fastly CDN cannot be achieved.
> 
> Not sure i undertand this point - it's a general issue for all proposals
> under discussion, no?

No. The explicitness of the new index makes it trivial for a project to be able
to depend only on projects that can be easily mirrored.

> 
>> Breaking backwards compatibility is always a hard choice, however I think it
>> makes sense in this case. There is no way to actually move forward on this
>> issue without either breaking or making potentially false claims about the
>> validity of a file. Furthermore the 7% of projects affected is the most
>> maximum way of doing the tally. I did not want my own biases to influence the
>> statistics so I tried to remove any editorializing from those statistics.
>> However that being said, a significant portion of that 7% has only a few
>> (sometimes only 1) old releases hosted externally. Often times when I've
>> pointed this out to authors they didn't even realize it and they had just
>> forgotten to call ``setup.py upload``. Finally of the projects left a lot of
>> them are very old (I've found some that were last released in 2003). A lot of
>> them do not work with any modern version of Python and some of them do not
>> even have a ``setup.py`` at all and thus are not installable at all. These
>> are all issues that my processing didn't attempt to classify because I wanted
>> to remove my personal bias from the numbers, but the simple fact is that while
>> the maximum amount may be 7%, the actual amount is going to be far far less
>> than that.
> 
> To get this a bit more scientific, do we have a way to measure the number
> of accesses to simple pages for pypi-crawl* hosted projects?
> Maybe also specifically for those projects who only have files externally?

We can access the number of accesses yes. However it’s not particularly
accurate. The last I looked about 25% of the requests to the PyPI simple
index pages are known mirroring clients. I know of others who are using
pip as a fake mirroring client in order to get the spidered external links.

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20140516/76a7b160/attachment-0001.sig>


More information about the Distutils-SIG mailing list