Re: [Distutils] Indexing modules in Python distributions
Thanks for cc-ing me Steve. I may be able to help jump-start this a bit and provide a platform for this to run on. I deployed a small service that scans PyPI to figure out statistics on Python 2 vs Python 3 support using PyPI Classifiers. The source is on GitHub: https://github.com/crwilcox/PyPI-Gatherer. It watches the PyPI updates feed and refreshes entries for packages as they show up as modified. It should be possible to add your lib, query, and add an additional row or two to the result. I am happy to work together on this. Also, the data is stored in an Azure Table Storage which has rest endpoints (and a Python SDK) that makes getting the published data straight-forward. Here is an example of using the data provided by the service. This is a Jupyter Notebook analysing Python 3 Adoption: https://notebooks.azure.com/chris/libraries/pypidataanalysis Thanks. Chris From: Steve Dower [mailto:steve.dower@python.org] Sent: Tuesday, 7 February, 2017 6:39 To: Thomas Kluyver <thomas@kluyver.me.uk>; distutils-sig@python.org Cc: Chris Wilcox <Christopher.Wilcox@microsoft.com> Subject: RE: [Distutils] Indexing modules in Python distributions I'm interested, and potentially in a position to provide funded infrastructure for this (though perhaps not as soon as you'd like, since things can move slowly at my end). My personal preference would be to download a full list. This is slow moving data that will gzip nicely, and my uses (in IDE) will require many tentative queries. I can also see value in a single-query API, but keep it simple - the value here is in the data, not the lookup. As far as updates go, most packaging systems should have some sort of release notification or update feed, so the work is likely going to be in hooking up to those and turning it into a scan task. Cheers, Steve Top-posted from my Windows Phone ________________________________ From: Thomas Kluyver<mailto:thomas@kluyver.me.uk> Sent: 2/7/2017 3:30 To: distutils-sig@python.org<mailto:distutils-sig@python.org> Subject: [Distutils] Indexing modules in Python distributions For a variety of reasons, I would like to build an index of what modules/packages are contained in which distributions ('packages') on PyPI. For instance: - Identifying requirements by static analysis of code: 'import zmq' -> requires pyzmq - Finding corresponding packages from different packaging systems: pyzmq on PyPI corresponds to pyzmq in conda, and python[3]-zmq in Debian repositories. This is an oversimplification, but importable module names provide a common basis to compare packages. I'd like a tool that could pick between different ways of installing a given module. People often assume that the import name is the same as the name on PyPI. This is true in the vast majority of cases, but there's no requirement that they are the same, and there are cases where they're not - pyzmq is one example. The metadata field 'Provides' is, according to PEP 314, intended for this purpose, but the standard packaging tools don't make it easy to use, and consequently very few packages specify it. I have started putting together a tool to index wheels. It reads a .whl file, finds modules inside it, and tries to identify namespace packages. It's still quite rough, but it worked with the wheels I tried. https://github.com/takluyver/wheeldex Is this something that other people are interested in? One thing I'm trying to work out at the moment is how the data would be accessed: as a web service that tools can query online, or more like Linux packaging, where tools download and cache a list to do lookups locally. Or both? There's also, of course, the question of how the index would be built and updated. Thanks, Thomas _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org<mailto:Distutils-SIG@python.org> https://mail.python.org/mailman/listinfo/distutils-sig
Thanks Steve, Chris, On Tue, Feb 7, 2017, at 04:49 PM, Chris Wilcox wrote:
I may be able to help jump-start this a bit and provide a platform for this to run on. I deployed a small service that scans PyPI to figure out statistics on Python 2 vs Python 3 support using PyPI Classifiers. The source is on GitHub: https://github.com/crwilcox/PyPI-Gatherer. It watches the PyPI updates feed and refreshes entries for packages as they show up as modified. It should be possible to add your lib, query, and add an additional row or two to the result. I am happy to work together on this. Also, the data is stored in an Azure Table Storage which has rest endpoints (and a Python SDK) that makes getting the published data straight-forward.
I had a quick look through this, and it does look like it should provide a useful framework for scanning PyPI and updating the results. :-) What I'm proposing differs in that it would need to download files from PyPI - basically all of them, if we're thorough about it. I imagine that's going to involve a lot of data transfer. Do we know what order of magnitude we're talking about? Is it so large that we should be thinking of running the scanner in the same data centre as the file storage? Thomas
On Wednesday, February 8, 2017, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
Thanks Steve, Chris,
On Tue, Feb 7, 2017, at 04:49 PM, Chris Wilcox wrote:
I may be able to help jump-start this a bit and provide a platform for this to run on. I deployed a small service that scans PyPI to figure out statistics on Python 2 vs Python 3 support using PyPI Classifiers. The source is on GitHub: https://github.com/crwilcox/PyPI-Gatherer. It watches the PyPI updates feed and refreshes entries for packages as they show up as modified. It should be possible to add your lib, query, and add an additional row or two to the result. I am happy to work together on this. Also, the data is stored in an Azure Table Storage which has rest endpoints (and a Python SDK) that makes getting the published data straight-forward.
I had a quick look through this, and it does look like it should provide a useful framework for scanning PyPI and updating the results. :-)
What I'm proposing differs in that it would need to download files from PyPI - basically all of them, if we're thorough about it. I imagine that's going to involve a lot of data transfer. Do we know what order of magnitude we're talking about? Is it so large that we should be thinking of running the scanner in the same data centre as the file storage?
So, IIUC, you're looking to emit ((URL, release, platform), namespaces_odict) for each new and all existing packages; by uncompressing every package and running every setup.py (hopefully in a container)? https://github.com/python/pypi-salt/blob/master/provisioning/salt/roots/pill... https://github.com/python/pypi-salt/blob/master/provisioning/salt/roots/pill... https://github.com/python/pypi-salt/blob/master/provisioning/salt/roots/salt... - https://github.com/pypa/warehouse/blob/master/warehouse/packaging/search.py - elasticsearch_dsl - https://github.com/pypa/warehouse/blob/master/warehouse/packaging/models.py - SQLAlchemy - https://github.com/pypa/warehouse/blob/master/warehouse/celery.py - celery - https://github.com/pypa/warehouse/blob/master/warehouse/legacy/api/json.py - namespaces are useful metadata (worth adding to the spec) - https://github.com/pypa/interoperability-peps/issues/31 - JSONLD - https://github.com/python/psf-salt/blob/master/pillar/prod/top.sls - https://github.com/python/psf-salt/blob/master/pillar/prod/roles.sls - One CI project (container FROM python: (debian)) per python package with additional metadata per project? - conda-forge solves for this case - and then how to post the extra metadata (build artifact) back from the CI build and mark the task as done Could this (namespace extraction) be added to 'setup.py build' for the future?
Thomas
On Wed, Feb 8, 2017, at 11:06 PM, Wes Turner wrote:
So, IIUC,
you're looking to emit
((URL, release, platform), namespaces_odict)
for each new and all existing packages;
by uncompressing every package and running every setup.py (hopefully in a container)?
Something like that, yes. For packages that publish wheels, we can analyse those directly without needing to run setup.py. Of course there are many packages with only sdists published.
Could this (namespace extraction) be added to 'setup.py build' for the future?
Potentially. As I mentioned, there is a place in the metadata to put this information - the 'Provides' field. However, relying on package uploaders would take a long time to build up decent coverage of the available packages, so I'm inclined to focus on scanning PyPI, similar to the tool Chris already showed. Thomas
On 8 February 2017 at 19:14, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
What I'm proposing differs in that it would need to download files from PyPI - basically all of them, if we're thorough about it. I imagine that's going to involve a lot of data transfer. Do we know what order of magnitude we're talking about? Is it so large that we should be thinking of running the scanner in the same data centre as the file storage?
Last time I asked Donald about doing things like this, he noted that a full mirror is ~215 GiB. That was a year or two ago so I assume the number has gone up since then, but it should still be in the same order of magnitude.
From an ecosystem resilience point of view, there's also a lot to be said for having copies of the full PyPI bulk artifact store in both AWS S3 (which is where the production PyPI data lives) and in Azure :)
Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 2017-02-08 18:14:38 +0000 (+0000), Thomas Kluyver wrote: [...]
What I'm proposing differs in that it would need to download files from PyPI - basically all of them, if we're thorough about it. I imagine that's going to involve a lot of data transfer. Do we know what order of magnitude we're talking about? [...]
The crowd I run with uses https://pypi.org/project/bandersnatch/ to maintain a full PyPI mirror for our project's distributed CI system, and du says the current aggregate size is 488GiB. Also if you want to initialize a full mirror this way, plan for it to take several days to populate. -- Jeremy Stanley
Thanks. So the current size is about 0.5 TB, and presumably if people are maintaining full mirrors, PyPI itself can cope with that much outgoing bandwidth being used. Steve & Chris: does downloading & scanning that volume of data sound like something you'd want to do on Azure? Does anyone there have some time to put in to move this forwards? Thomas On Thu, Feb 9, 2017, at 10:18 PM, Jeremy Stanley wrote:
On 2017-02-08 18:14:38 +0000 (+0000), Thomas Kluyver wrote: [...]
What I'm proposing differs in that it would need to download files from PyPI - basically all of them, if we're thorough about it. I imagine that's going to involve a lot of data transfer. Do we know what order of magnitude we're talking about? [...]
The crowd I run with uses https://pypi.org/project/bandersnatch/ to maintain a full PyPI mirror for our project's distributed CI system, and du says the current aggregate size is 488GiB. Also if you want to initialize a full mirror this way, plan for it to take several days to populate. -- Jeremy Stanley _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On Feb 13, 2017, at 12:25 PM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
Thanks. So the current size is about 0.5 TB, and presumably if people are maintaining full mirrors, PyPI itself can cope with that much outgoing bandwidth being used.
Yea, PyPI does something like 16TB a day of bandwidth :) — Donald Stufft
participants (6)
-
Chris Wilcox
-
Donald Stufft
-
Jeremy Stanley
-
Nick Coghlan
-
Thomas Kluyver
-
Wes Turner