[Distutils] PEP 438, pip and --allow-external (was: "pip: cdecimal an externally hosted file and may be unreliable" from python-dev)

Donald Stufft donald at stufft.io
Mon May 12 04:34:04 CEST 2014


On May 11, 2014, at 10:27 PM, Donald Stufft <donald at stufft.io> wrote:

> 
> On May 11, 2014, at 7:35 PM, Donald Stufft <donald at stufft.io> wrote:
> 
>> However before I go further on that I want to dig more into the impact of these
>> things. It dawned on me earlier today that the way I was categorizing things
>> in my earlier number crunching was making it unreasonably hard to actually
>> divine any sort of meaning out of those numbers. I'm currently in the process
>> of crawling all of PyPI again*, after I have those new numbers I'll have a
>> better sense of things and I think a better forward plan can be made.
> 
> 
> I've completed the crawl. I've made the scripts and the data available at
> https://github.com/dstufft/pypi-external-stats.
> 
> Here's the general statistics from that:
> 
> Hosted on PyPI: 37779
> Hosted Externally (<50%): 18
> Hosted Externally (>50%): 47
> Hosted Externally: 65
> Hosted Unsafely (<50%): 725
> Hosted Unsafely (>50%): 2249
> Hosted Unsafely: 2974
> 
> The data more or less follows what the rest of the data has pointed to. However
> I've changed my method of categorizing the projects. Previously I had split the
> projects into "only has filed hosted using type X" and "has any files hosted
> using type X". This categorization made it hard to accurately determine impact.
> The problem is that a lot of projects have the same files uploaded to PyPI, but
> also available unsafely. A project like this will not be impacted by a change
> in hosting however it wasn't possible to determine this using the previous
> data.
> 
> The new method splits all of the files for a particular project into a set of
> {PyPI, External, Unsafe}. It splits every file it finds into one of these
> categories. Finally once it has filled out the categories for all of them it
> it removes duplicate files (via exact filenames). It prefers files hosted on
> PyPI over files hosted externally, and it prefers files hosted externally over
> those hosted unsafely. This leads to the projects like the above example to
> accurately represent where the *best* source for it's files are, not anywhere
> it can locate that file.
> 
> The statistics also split out projects which have > 50% of their files
> hosted externally or unsafely apart from files which have < 50% of their files
> hosted externally or unsafely. The reasoning behind this is that there are
> projects which have one or two files hosted externally or unsafely and the
> impact of changes in this area are much less for a project that hosts all of
> it's files externally or unsafely vs one that has just one or two old releases
> hosted in that fashion. For completeness sake I've also included the total
> numbers for each of the split options for easier comparison.
> 
> Finally it's important to note that defining what exactly is an installable
> file is difficult to do. In this script I've tried to take a maximal stance and
> err on the side of assuming something is an installable file. Specifically I
> do not have any detection of:
> 
> * Filenames do not match the project name (e.g. bar-1.0.tar.gz linked from
>   foo's page).
> 
> * The file that is being linked to still exists at all (e.g. 404 or NXDOMAIN).
> 
> * The file that is being linked to unpacks successfully and has a setup.py and
>   or other requirements to be a successfully installed package.
> 
> * (pip specific) The file has a sane version number that follows PEP440 and/or
>   is not a pre-release.
> 
> * It is unlikely that these numbers are accurate for any one particular
>   installer. In particular pip does not support .egg's but this detection does
>   however pip, and this detection, does support .whl's while setuptools does
>   not.
> 
> The rules for detection are essentially:
> 
> 1. Look at /simple/<foo>/ for that project.
> 2. Look for any URL with a rel=internal and count it as an PyPI hosted file.
> 3. Look for any URL that "looks" installable, this means that the path in the
>    URL ends with {.tar, .tar.gz, tar.bz2, .zip, .tgz, .egg, .whl} which also
>    has a #<hashname>=<hashvalue> fragment and count it as a externally hosted
>    file.
> 4. Look for any URL that "looks" installable which does not have a hash URL
>    fragment and count it as an unsafely hosted file.
> 5. Look for any URL that does not "look" installable which has a rel of
>    {download, homepage} and process them.
> 6. Look at the HTML from #5 and look for URLs that look installable, with or
>    without a hash fragment and count it as an unsafely hosted file.
> 7. Deduplicate the found filenames by ensuring that each filename exists for
>    a project only once, with the preference of PyPI > external > unsafe.
> 
> 
> * In all places I've used PyPI to mean hosted on PyPI, external to mean hosted
>   externally and safely, and unsafely to mean hosted externally and unsafely.

Oh, and Paul had asked before. Here’s the list of externally hosted projects:

https://github.com/dstufft/pypi-external-stats/blob/master/2014-05-11/processed.json#L2-L69

And here’s the list of unsafely hosted projects:

https://github.com/dstufft/pypi-external-stats/blob/master/2014-05-11/processed.json#L37852-L40829

The external1 and unsafe1 represents the <50% set and external2 and unsafe2
represents the >50% set.

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20140511/e0362ece/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20140511/e0362ece/attachment.sig>


More information about the Distutils-SIG mailing list