[Distutils] PEP470, backward compat is a ...

Donald Stufft donald at stufft.io
Tue May 20 04:29:23 CEST 2014


Just an update, asyncmongo has released to PyPI now, so I’ve removed
them from the gists as well. Still no word back from PIL.

On May 18, 2014, at 11:21 AM, Donald Stufft <donald at stufft.io> wrote:

> 
> On May 18, 2014, at 2:20 AM, holger krekel <holger at merlinux.eu> wrote:
> 
>> On Sat, May 17, 2014 at 20:20 -0400, Donald Stufft wrote:
>>> On May 17, 2014, at 1:51 PM, holger krekel <holger at merlinux.eu> wrote:
>>> 
>>>> On Sat, May 17, 2014 at 11:32 -0400, Donald Stufft wrote:
>>>>> More conclusions!
>>>>> 
>>>>> In that same time period PyPI received a total of ~16463209 hits to a page on
>>>>> the simple installer API. This means that in total these projects represent
>>>>> a combined 0.56% of the simple installer traffic on PyPI. However looking at
>>>>> the numbers you can see that PIL is an obvious outlier with the hits dropping
>>>>> drastically after that. PIL on it's own represents 0.44% of the hits on PyPI
>>>>> during that time period leaving only 0.12% for anything not PIL.
>>>> 
>>>> So the current numbers roughly mean that around 92193 end-user sites per
>>>> day depend on crawling currently, right?  Do you know if these are also
>>>> unique IPs (they might indicate duplicates although companies also have NATting
>>>> firewalls)?
>>>> 
>>>> holger
>>> 
>>> Here’s the number of IP addresses that accessed each /simple/ page per day.
>>> 
>>> https://gist.github.com/dstufft/347112c3bcc91220e4b2
>>> 
>>> Unique IPs: 95541
>>> Unique IPs for Only Hosted off PyPI: 8248 (8.63%)
>>> Unique IPs for Only Hosted off PyPI w/o PIL: 2478 (2.59%)
>>> 
>>> It's important to remember when looking at these numbers that almost all of
>>> them represent something downloading a package unsafely which will generally
>>> contain Python code that they will then be executed. Breaking the unsafe thing
>>> is, in my opinion, non optional and the only thing needed to be discussed about
>>> it is how to go about doing it exactly. The safe thing I think *should* be
>>> removed for the various other reasons that have been outlined and it only
>>> represents a tiny fraction of uses.
>>> 
>>> The numbers to be specific are, 8248 of the above 8248 IPs downloaded something
>>> unsafely, while 214 of them also downloaded something safely. That means that
>>> 100% of the 8248 addresses could have been attacked through their use of PyPI
>>> and only 2.59% downloaded anything that was safely hosted off of PyPI.
>>> 
>>> Looking at the same numbers for projects which have *any* files hosted off of
>>> PyPI (the numbers thus far have been projects which have *only* files hosted
>>> off of PyPI) I see that 35046 IP addresses accessed a project that had any
>>> unsafely hosted off of PyPI files while only 2852 IP addresses accessed a
>>> project that had any safely hosted off of PyPI files.
>>> 
>>> That means that roughly a minimum floor of ~36% of the users of PyPI were
>>> vulnerable to a MITM attack on 2014-05-14 unless they were using pip 1.5
>>> without any --allow-unverified flags or they were using pip 1.4 with
>>> --allow-no-insecure and even in that case they could still be vulnerable if
>>> there is any use of setup_requires. I say that's a minimum because that only
>>> counts the projects where I happened to find a file hosted unsafely externally.
>>> It does not count at all any projects which I did not find a file like that but
>>> which still has locations on their simple page like that. This is especially
>>> troublesome for projects where they have old domain names in those links that
>>> point to domains that are no longer registered.
>>> 
>>> Also just FYI I've removed pyPDF from both lists as I've contacted the author
>>> and there are packages now hosted on PyPI for it. I've also contacted PIL and a
>>> few other authors (of which I've just heard back from cx_Oracle and they appear
>>> to be willing to upload as well).
>> 
>> Thanks Donald for both the numbers and contacting some key authors which
>> i think is a very good move!  I suggest to now wait a week or so to see
>> where we stand then, update the numbers and then try to settle on
>> crawl-deprecation paths.
>> 
>> Also, let's please just talk about "checksummed" packages or integrity.  
>> Even all pypi hosted packages are unsafe in the sense that they 
>> might contain bad code from malicious uploaders or http-interceptors 
>> that executes on end-user machines during installation.  Thus the term
>> "safe" is misleading and should not be used when communicating to
>> end-users.  Currently, we can only say or improve anything related to
>> integrity: what people download is what was uploaded by whoever happened
>> to have the credentials (*) or MITM access on http upload.  Speaking of the
>> latter, maybe we should also think about moving to https uploads and
>> certificate-pinning, and that also for installers.  And also, as Marius
>> pointed out, pypi is currently using the relatively weak MD5 hash.
> 
> The problem with upload is when people use setup.py upload they are often times
> using the upload from distutils. Since that is in the standard library we can't
> really go backwards in time and make it safe. People who use my twine utility
> to upload instead of setup.py upload are not vulnerable to MITM on upload.
> 
> While I don't particularly like the MD5 hash, it's not true that the MD5 hash
> current presents a problem against the threat model that we're worried about.
> It's relatively easy to generate a collision attack, which would mean that a
> malicious author could generate two packages, an unsafe and a safe one that
> hashed to the same thing. However MD5 is still resistant to 2nd preimage
> attacks so an attacker could not create a package that hashes to a given hash.
> 
>> 
>> Without resolving these issues we can not even truthfully declare
>> integrity as something that the pypi-hosted packages themselves are providing.
> 
> We cannot fix every problem at once. Right now the tools exist for authors to
> make it possible to do everything safely. The externally hosted files represent
> an easier to exploit attack than a MITM on author upload. The MITM requires a
> privileged network position on specific individuals whom are also not using
> twine or the browser to upload their distributions.
> 
> Attacking people who are installing these packages is far easier. It would
> either require a privileged network position on one of ~90k IP addresses on any
> particular day (a much easier feat than for authors periodically) or, even
> easier, locate an expired domain registration and simply register the domain
> which wouldn't require a privileged network position at all.
> 
>> 
>> best,
>> holger
>> 
>> (*) did you happen to have run some password crackers against
>> the pypi database?  Might be a larger attack vector than highjacking
>> DNS entries.
> 
> No I have not. The database currently uses bcrypt with a work factor of 12
> which makes it computationally hard for me to brute force passwords for all
> ~30k users which have a password set. If there was a specific user I was
> interested in a smart brute force attack might be able to locate something.
> Rate-limiting log in attempts is also on the list of things to add in
> Warehouse.
> 
> -----------------
> Donald Stufft
> PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
> 
> _______________________________________________
> Distutils-SIG maillist  -  Distutils-SIG at python.org
> https://mail.python.org/mailman/listinfo/distutils-sig


-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20140519/669e51f5/attachment.sig>


More information about the Distutils-SIG mailing list