On Wednesday, February 8, 2017, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
Thanks Steve, Chris,
On Tue, Feb 7, 2017, at 04:49 PM, Chris Wilcox wrote:
I may be able to help jump-start this a bit and provide a platform for this to run on. I deployed a small service that scans PyPI to figure out statistics on Python 2 vs Python 3 support using PyPI Classifiers. The source is on GitHub: https://github.com/crwilcox/PyPI-Gatherer. It watches the PyPI updates feed and refreshes entries for packages as they show up as modified. It should be possible to add your lib, query, and add an additional row or two to the result. I am happy to work together on this. Also, the data is stored in an Azure Table Storage which has rest endpoints (and a Python SDK) that makes getting the published data straight-forward.
I had a quick look through this, and it does look like it should provide a useful framework for scanning PyPI and updating the results. :-)
What I'm proposing differs in that it would need to download files from PyPI - basically all of them, if we're thorough about it. I imagine that's going to involve a lot of data transfer. Do we know what order of magnitude we're talking about? Is it so large that we should be thinking of running the scanner in the same data centre as the file storage?
So, IIUC, you're looking to emit ((URL, release, platform), namespaces_odict) for each new and all existing packages; by uncompressing every package and running every setup.py (hopefully in a container)? https://github.com/python/pypi-salt/blob/master/provisioning/salt/roots/pill... https://github.com/python/pypi-salt/blob/master/provisioning/salt/roots/pill... https://github.com/python/pypi-salt/blob/master/provisioning/salt/roots/salt... - https://github.com/pypa/warehouse/blob/master/warehouse/packaging/search.py - elasticsearch_dsl - https://github.com/pypa/warehouse/blob/master/warehouse/packaging/models.py - SQLAlchemy - https://github.com/pypa/warehouse/blob/master/warehouse/celery.py - celery - https://github.com/pypa/warehouse/blob/master/warehouse/legacy/api/json.py - namespaces are useful metadata (worth adding to the spec) - https://github.com/pypa/interoperability-peps/issues/31 - JSONLD - https://github.com/python/psf-salt/blob/master/pillar/prod/top.sls - https://github.com/python/psf-salt/blob/master/pillar/prod/roles.sls - One CI project (container FROM python: (debian)) per python package with additional metadata per project? - conda-forge solves for this case - and then how to post the extra metadata (build artifact) back from the CI build and mark the task as done Could this (namespace extraction) be added to 'setup.py build' for the future?
Thomas