[Distutils] PEP470, backward compat is a ...

Donald Stufft donald at stufft.io
Fri May 16 15:01:40 CEST 2014


On May 16, 2014, at 8:45 AM, holger krekel <holger at merlinux.eu> wrote:

> On Fri, May 16, 2014 at 08:20 -0400, Donald Stufft wrote:
>> 
>> 
>> Uploading was not vulnerable to heart bleed, but only because uploading
>> doesn’t generally use HTTPS at all yet.
> 
> Wait, uploading release files does not use https?  I use
> "https://pypi.python.org/pypi" as the upload endpoint.
> And it transfers basic auth.  Sounds to me like heartbleed could
> very easily have gotten at this information and uploaded files in my name.

Sorry, let me be more specific, by default uploading was not vulnerable to
heartbleed.

> 
>> The likelihood that one of many hosts, especially given that many of them
>> have expired domain names attached to them, is compromised is far
>> greater than that of PyPI. If I was a malicious actor the first thing I would
>> do upon hearing your proposal is go and look for any project which get's
>> any traffic also lists an external domain that has since expired. Then I would
>> register that domain and put files up on it that would be the latest version
>> and is API compatible with the old latest version except it includes malicious
>> software.
> 
> Sure, i am aware of this issue - we both discussed it at PEP438 time.
> But why would you have waited with your evil comprising activity until now
> when you could use and benefit from the same technique already before?
> (The PEP itself says that people mindlessly enable crawling IIRC).

Perhaps you already *are* doing that. Perhaps you just learn about it because
of this PEP. Perhaps something that people have to opt in with an “unsafe” flag
wasn’t motivation enough for you.

The point is, it doesn’t really matter when or why. The proposal turns that
situation from something that we don’t make any claims about it’s validity to
something where we state "Yes this is the valid file". I am completely against
PyPI making *any* claims like that which depend on us guessing that the remote
host is still owned by the original author.

> 
>> It is a big deal that we have no idea who owns those external hosts and if
>> they are still owned by the original owner.
> 
> If it's sourceforge or google code or some other reasonably reliable
> hosting facilitiy we know about ownership.  Their integrity and reliability
> is not neccessarily worse as our handmade pypi/CDN interactions.
> 
> I wonder how many other random sites we have that are hosting
> release files.

The problem is, even though these other sites have a *higher* chance of being
still under the control of the correct person, it's completely reasonable to
assume that someone may have deleted their project on those hosts because
they were abandoning it but never got around to deleting it on PyPI, or because
they used to host on one of those sites then moved to github and decided to
delete the old stuff.

> 
>>>> On top of that, it still fails to address:
>>>> 
>>>> * The reliability of the externally hosted files, especially for projects which
>>>> are now "stale". How likely is it that an unmaintained project ends up having
>>>> it's external file links bitrot?
>>> 
>>> I noted above the use of PIP_DOWNLOAD_CACHE to help with reliability.
>>> People can also use devpi-server to cache external files locally and
>>> become less dependent on availability of external sites if needed. 
>>> 
>>> Note that requiring external indices has a harder reliability problem:
>>> an according install will need to get the simple page even if it uses
>>> PIP_DOWNLOAD_CACHE.
>> 
>> pip 1.6 removes the download cache and replaces it with an on by default HTTP
>> cache that respects cache headers.
> 
> ah, nice.
> 
>> Requiring external indexes does not have a *harder* reliability problem, it
>> has the same problem except it’s much more explicit and lends projects to
>> host on PyPI unless they have a good reason not to.
>>> 
>>>> * The legality of mirroring. End users trying to mirror are still responsible
>>>> for determining if they are able to mirror this file. This is especially
>>>> important in China or other bandwidth constrained environments where good
>>>> access (or access at all) to the Fastly CDN cannot be achieved.
>>> 
>>> Not sure i undertand this point - it's a general issue for all proposals
>>> under discussion, no?
>> 
>> No. The explicitness of the new index makes it trivial for a project to be able
>> to depend only on projects that can be easily mirrored.
> 
> Hum, still not sure i fully understand this point and its relevance
> but i'll leave that for now.
> 
>>>> Breaking backwards compatibility is always a hard choice, however I think it
>>>> makes sense in this case. There is no way to actually move forward on this
>>>> issue without either breaking or making potentially false claims about the
>>>> validity of a file. Furthermore the 7% of projects affected is the most
>>>> maximum way of doing the tally. I did not want my own biases to influence the
>>>> statistics so I tried to remove any editorializing from those statistics.
>>>> However that being said, a significant portion of that 7% has only a few
>>>> (sometimes only 1) old releases hosted externally. Often times when I've
>>>> pointed this out to authors they didn't even realize it and they had just
>>>> forgotten to call ``setup.py upload``. Finally of the projects left a lot of
>>>> them are very old (I've found some that were last released in 2003). A lot of
>>>> them do not work with any modern version of Python and some of them do not
>>>> even have a ``setup.py`` at all and thus are not installable at all. These
>>>> are all issues that my processing didn't attempt to classify because I wanted
>>>> to remove my personal bias from the numbers, but the simple fact is that while
>>>> the maximum amount may be 7%, the actual amount is going to be far far less
>>>> than that.
>>> 
>>> To get this a bit more scientific, do we have a way to measure the number
>>> of accesses to simple pages for pypi-crawl* hosted projects?
>>> Maybe also specifically for those projects who only have files externally?
>> 
>> We can access the number of accesses yes. However it’s not particularly
>> accurate. The last I looked about 25% of the requests to the PyPI simple
>> index pages are known mirroring clients. I know of others who are using
>> pip as a fake mirroring client in order to get the spidered external links.
> 
> Still, without a proper analysis we can only have gut feelings and use
> rough estimates.  When i said "a 1000 pypi-crawl* using project
> maintainers might not react", i was only using a third of the current
> 2974 ones.  And 100 downloads per day on average is not a very high
> number, still it would result in 100K installation issues on the end
> user side.  I believe this is big enough to warrant attention and
> more sincere tries to reduce it.

Well, when I say it’s not accurate, I also mean you can really only get any
visibility into what it’s going to do for projects which are hosted 100%
externally. If a project only has a few files hosted externally and has some
or most hosted on PyPI than we have no way of differentiating between a request
that would have served from PyPI and a request that would have served
externally.

> 
> Even if we were to not do the automatic conversion, we can simply point
> maintainers to the proposed conversion tool -- a much easier one-time
> thing compared to advising them to setup a external hosted index.
> But i maintain we should eventually do the automatic conversion and
> teach the tools to warn install users about externals coming from stale
> sources.

Maintaining an external index is not hard in the slightest. You can trivially
do it for free using pythonhosted.org, github pages, or any other number of
places like that.

> 
> As it stands, PEP470 does not discuss the stale maintainer issue,
> estimate how many it might affect, and weight it against other considerations.
> (part of which we are currently doing in our discussion here ATM).

A project that is no longer maintained and which is hosted externally is more
likely to go missing at some point or have one of the scraped URLs expire
and be able to be picked up by a malicious author. They are also more likely
to just flat out not work anymore because they were developed for an ancient
version of Python or they were one of the ones that didn't include a proper
setup.py.

> 
> I am mostly off for the weekend now,
> see you next week or so,
> 
> holger


-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20140516/7d3f25a5/attachment-0001.sig>


More information about the Distutils-SIG mailing list