Mirroring PyPI JSON Locally
Hi all, First time emailer, so please be kind. Also, if this is not the right mailing list for PyPA talk, I apologize. Please point me in the right direction if so (Brett Canon pointed me here). The main reason I have emailed here is I believe it may be PEP time to standardize the JSON metadata that PyPI makes available, like what was done for the `'simple API` described in PEP503. I've been doing a bit of work on `bandersnatch` (I didn't name it), which is a PEP 381 mirroring package and wanted to enhance it to also mirror the handy JSON metadata PyPI generates and makes available @ https://pypi.python.org/pypi/PKG_NAME/json. I've done a PR on bandersnatch as a POC that mirrors both the PyPI directory structure (URL/pypi/PKG_NAME/json) and created a standardizable URL/json/PKG_NAME that the former symlinks to (to be served by NGINX / some other proxy). I'm also contemplating naming the directory 'metadata' rather than JSON so if some new hotness / we want to change the format down the line we're not stuck with json as the dirname. This PR can be found here: https://bitbucket.org/pypa/bandersnatch/pull-requests/33/save-json- metadata-to-mirror My main use case is to write a very simple async 'verifier' tool that will crawl all the JSON files and then ensure the packages directory on each of my internal mirrors (I have a mirror per region / datacenter) have all the files they should. I sync centrally (to save resource on the PyPI infrastructure) and then rsync out all the diffs to each region / datacenter, and under some failure scenarios I could miss a file or many. So I feel using JSON pulled down from the authoritative source will allow an async job to verify the MD5 of all the package files on each mirror. What are peoples thoughts here? Is it worth a PEP similar to PEP503 going forward? Can people enhance / share some thoughts on this idea. Thanks, Cooper Ry Lees me@cooperlees.com <me@copperlees.com> https://cooperlees.com/
If this were to be done, then IMO yes, a PEP would be the right way to standardise the JSON API. But tools like pip don't use the JSON API much, and tools like devpi that expose the index API don't bother with the JSON API (so making it less likely that consumers that want to work with indexes other than PyPI will use it). So you may not get much interest. On the other hand, a PEP that simply documents the API and says "Index providers that choose to support the JSON API must do so with this interface" would probably be useful, and unlikely to get a lot of pushback (assuming you document what Warehouse and PyPI provide, and allow other providers to simply not provide anything). Paul On 13 August 2017 at 07:53, Cooper Ry Lees <lists@cooperlees.com> wrote:
Hi all,
First time emailer, so please be kind. Also, if this is not the right mailing list for PyPA talk, I apologize. Please point me in the right direction if so (Brett Canon pointed me here). The main reason I have emailed here is I believe it may be PEP time to standardize the JSON metadata that PyPI makes available, like what was done for the `'simple API` described in PEP503.
I've been doing a bit of work on `bandersnatch` (I didn't name it), which is a PEP 381 mirroring package and wanted to enhance it to also mirror the handy JSON metadata PyPI generates and makes available @ https://pypi.python.org/pypi/PKG_NAME/json.
I've done a PR on bandersnatch as a POC that mirrors both the PyPI directory structure (URL/pypi/PKG_NAME/json) and created a standardizable URL/json/PKG_NAME that the former symlinks to (to be served by NGINX / some other proxy). I'm also contemplating naming the directory 'metadata' rather than JSON so if some new hotness / we want to change the format down the line we're not stuck with json as the dirname. This PR can be found here: https://bitbucket.org/pypa/bandersnatch/pull-requests/33/save-json-metadata-...
My main use case is to write a very simple async 'verifier' tool that will crawl all the JSON files and then ensure the packages directory on each of my internal mirrors (I have a mirror per region / datacenter) have all the files they should. I sync centrally (to save resource on the PyPI infrastructure) and then rsync out all the diffs to each region / datacenter, and under some failure scenarios I could miss a file or many. So I feel using JSON pulled down from the authoritative source will allow an async job to verify the MD5 of all the package files on each mirror.
What are peoples thoughts here? Is it worth a PEP similar to PEP503 going forward? Can people enhance / share some thoughts on this idea.
Thanks, Cooper Ry Lees me@cooperlees.com https://cooperlees.com/
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
(cc'ing here from python-ideas) Here are some notes re: changing metadata: https://github.com/pypa/interoperability-peps/issues/31 (closed; but still very relevant) https://www.google.com/search?q=pep426jsonld Towards JSONLD is the best approach, I think. So, that means it would be best to, if you need to add additional metadata (?) and must key things, also copy the key into an object: {"thing1": {"@id": "thing1", "url": "..."}} Instead of just: {"thing1": {"url": "..."}} https://github.com/pypa/interoperability-peps/issues/31#issuecomment-2331955... On Sunday, August 13, 2017, Cooper Ry Lees <lists@cooperlees.com> wrote:
Hi all,
First time emailer, so please be kind. Also, if this is not the right mailing list for PyPA talk, I apologize. Please point me in the right direction if so (Brett Canon pointed me here). The main reason I have emailed here is I believe it may be PEP time to standardize the JSON metadata that PyPI makes available, like what was done for the `'simple API` described in PEP503.
I've been doing a bit of work on `bandersnatch` (I didn't name it), which is a PEP 381 mirroring package and wanted to enhance it to also mirror the handy JSON metadata PyPI generates and makes available @ https://pypi.python.org/pypi/PKG_NAME/json.
I've done a PR on bandersnatch as a POC that mirrors both the PyPI directory structure (URL/pypi/PKG_NAME/json) and created a standardizable URL/json/PKG_NAME that the former symlinks to (to be served by NGINX / some other proxy). I'm also contemplating naming the directory 'metadata' rather than JSON so if some new hotness / we want to change the format down the line we're not stuck with json as the dirname. This PR can be found here: https://bitbucket.org/pypa/bandersnatch/pull-requests/ 33/save-json-metadata-to-mirror
My main use case is to write a very simple async 'verifier' tool that will crawl all the JSON files and then ensure the packages directory on each of my internal mirrors (I have a mirror per region / datacenter) have all the files they should. I sync centrally (to save resource on the PyPI infrastructure) and then rsync out all the diffs to each region / datacenter, and under some failure scenarios I could miss a file or many. So I feel using JSON pulled down from the authoritative source will allow an async job to verify the MD5 of all the package files on each mirror.
What are peoples thoughts here? Is it worth a PEP similar to PEP503 going forward? Can people enhance / share some thoughts on this idea.
Thanks, Cooper Ry Lees me@cooperlees.com <javascript:_e(%7B%7D,'cvml','me@copperlees.com');> https://cooperlees.com/
Here are some notes on making this more efficient: "Add API endpoint to get latest version of all projects" https://github.com/pypa/warehouse/issues/347 ... http://markmail.org/search/?q=list:org.python.distutils-sig + { metadata , pep426jsonld }
On 19 August 2017 at 23:03, Wes Turner <wes.turner@gmail.com> wrote:
Here are some notes re: changing metadata:
I thought the proposal was to document the current state of affairs. Proposing a *change* to the JSON API would be a much bigger and more controversial proposal (and one I see little need for, personally). Paul
On Sunday, August 20, 2017, Paul Moore <p.f.moore@gmail.com> wrote:
On 19 August 2017 at 23:03, Wes Turner <wes.turner@gmail.com <javascript:;>> wrote:
Here are some notes re: changing metadata:
I thought the proposal was to document the current state of affairs.
There are links to the current PEPs and source codes in there, too.
Proposing a *change* to the JSON API would be a much bigger and more controversial proposal (and one I see little need for, personally).
We'd probably want to review how the dependency resolver work is going. https://github.com/pypa/pip/issues/988#issuecomment-322255801 https://pradyunsg.github.io/gsoc-2017/08/14/final-lap/ IIUC, the task is still to: Download transitive portions of a [linked data] graph as JSON[LD] (optimally without iteratively downloading and decompressing package archives in order to retrieve their platform-dependent dependency edge metadata from a setup.py that is executed with filesystem privileges). https://en.wikipedia.org/wiki/Pip_(Great_Expectations)
On 21 August 2017 at 00:51, Wes Turner <wes.turner@gmail.com> wrote:
IIUC, the task is still to: Download transitive portions of a [linked data] graph as JSON[LD] (optimally without iteratively downloading and decompressing package archives in order to retrieve their platform-dependent dependency edge metadata from a setup.py that is executed with filesystem privileges).
While I'm still generally negative on the idea of native reliance on JSON-LD, I'll note one thing that has changed since I last looked at it: I now see some potential concrete practical benefits to adopting it, rather than purely theoretical ones. In particular, https://github.com/scienceai/jsonld-vis now exists, and there wasn't anything like that around at the time of previous discussions. However, that's still only of potential interest for PEP 426, which in turn still isn't needed for any of our practical near term objectives (not even the "dependencies without downloads" one - if we were to prioritise that, we'd likely go for something closer to the way client side dependency resolution already works, such as extracting the METADATA file from uploaded wheels and making it available for download in addition to the full wheel archives). So for this thread, Paul's right: a PEP-503-style document describing the current PyPI JSON API would likely be a reasonable thing to write, as it would allow for more complete emulations of the current production PyPI service, including checking that Warehouse replicates that part of the API correctly. Anything more than that is still on the "No, not until some point after Legacy PyPI has been shut down" list. Cheers, Nick. P.S. Some of the tools mentioned at http://www.seoskeptic.com/structured-data-markup-validation-testing-tools/ may also prove useful if we go down the JSON-LD path. However, that is the only reason I'd ever support us going down that path: useful functionality that we get for free by virtue of adopting an established convention. I've wrangled volunteer contributors to open source projects for long enough now to know that "because it's the right thing to do" simply doesn't cut it as a motivational tool - there's need to be some kind of actual benefit to the folks doing the work :) -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 21 August 2017 at 09:54, Nick Coghlan <ncoghlan@gmail.com> wrote:
While I'm still generally negative on the idea of native reliance on JSON-LD, I'll note one thing that has changed since I last looked at it: I now see some potential concrete practical benefits to adopting it, rather than purely theoretical ones. In particular, https://github.com/scienceai/jsonld-vis now exists, and there wasn't anything like that around at the time of previous discussions.
Personally, I fairly often write adhoc scripts that use the JSON API, and as it stands it's very convenient for that. From what I can see of JSON-LD (which basically equates to "it adds some extra metadata keys that don't change the data content but do change the list of keys and maybe the nesting levels") it would be somewhat inconvenient for my scripts, and add no extra capability that I would ever use. Before we consider anything like JSON-LD, I think we need a much clearer picture of who uses the JSON API. If it's production-type applications, then maybe it would be useful, but if it's mostly advoc scripts (as I suspect) it's additional complexity for little or no benefit. But this remains off-topic for now, so that's all I'll say. Paul
On 21 August 2017 at 19:38, Paul Moore <p.f.moore@gmail.com> wrote:
On 21 August 2017 at 09:54, Nick Coghlan <ncoghlan@gmail.com> wrote:
While I'm still generally negative on the idea of native reliance on JSON-LD, I'll note one thing that has changed since I last looked at it: I now see some potential concrete practical benefits to adopting it, rather than purely theoretical ones. In particular, https://github.com/scienceai/jsonld-vis now exists, and there wasn't anything like that around at the time of previous discussions.
Personally, I fairly often write adhoc scripts that use the JSON API, and as it stands it's very convenient for that. From what I can see of JSON-LD (which basically equates to "it adds some extra metadata keys that don't change the data content but do change the list of keys and maybe the nesting levels") it would be somewhat inconvenient for my scripts, and add no extra capability that I would ever use.
Right, and this is still my main concern with the idea as well: I'd never be OK with a JSON-LD-only API, because it adds too much irrelevant cognitive overhead for the vast majority of Python packaging specific use cases. (I would see it as being akin to Python itself deciding to require type annotations, rather than merely allowing them). However, where I'm starting to see a potential niche for it is as an opt-in capability, whereby we explicitly define how our metadata can be translated *to* JSON-LD, for folks that want to apply general purpose tools that know how to manipulate arbitrary JSON-LD data (like the graph visualiser I linked earlier). That way, everybody wins - folks that have never heard of schema.org or linked data in general won't need to learn any concepts that are completely irrelevant to them, while folks that are aware of those things and the related tools will be free to use them without first having to figure out their own mapping from the Python specific metadata formats to a JSON-LD compatible format. That approach then doesn't even need to wait for PEP 426: it could be done using the wheel METADATA file as a basis instead. It will probably still be up to Wes to actually define that transformation though - I don't think anybody else is anywhere near keen enough to make use of the available JSON-LD tooling to spend any time working on enabling it :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Monday, August 21, 2017, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 21 August 2017 at 19:38, Paul Moore <p.f.moore@gmail.com <javascript:;>> wrote:
On 21 August 2017 at 09:54, Nick Coghlan <ncoghlan@gmail.com <javascript:;>> wrote:
While I'm still generally negative on the idea of native reliance on JSON-LD, I'll note one thing that has changed since I last looked at it: I now see some potential concrete practical benefits to adopting it, rather than purely theoretical ones. In particular, https://github.com/scienceai/jsonld-vis now exists, and there wasn't anything like that around at the time of previous discussions.
Personally, I fairly often write adhoc scripts that use the JSON API, and as it stands it's very convenient for that. From what I can see of JSON-LD (which basically equates to "it adds some extra metadata keys that don't change the data content but do change the list of keys and maybe the nesting levels") it would be somewhat inconvenient for my scripts, and add no extra capability that I would ever use.
Right, and this is still my main concern with the idea as well: I'd never be OK with a JSON-LD-only API, because it adds too much irrelevant cognitive overhead for the vast majority of Python packaging specific use cases. (I would see it as being akin to Python itself deciding to require type annotations, rather than merely allowing them).
However, where I'm starting to see a potential niche for it is as an opt-in capability, whereby we explicitly define how our metadata can be translated *to* JSON-LD, for folks that want to apply general purpose tools that know how to manipulate arbitrary JSON-LD data (like the graph visualiser I linked earlier).
That way, everybody wins - folks that have never heard of schema.org or linked data in general won't need to learn any concepts that are completely irrelevant to them, while folks that are aware of those things and the related tools will be free to use them without first having to figure out their own mapping from the Python specific metadata formats to a JSON-LD compatible format.
That approach then doesn't even need to wait for PEP 426: it could be done using the wheel METADATA file as a basis instead.
It will probably still be up to Wes to actually define that transformation though - I don't think anybody else is anywhere near keen enough to make use of the available JSON-LD tooling to spend any time working on enabling it :)
So, ## Justify JSONLD - This is a graph. If we use an existing spec for graphs as JSON (ie JSONLD), we win: - all of the tools that already exist for working with said graphs in that format - easy indexability (as RDF quads) - compatibility with compatible specs like ld-signatures ## Implement JSONLD - [ ] decide which URI(s) a project on {pypi,} is identified by - some projects should not have an implicit pypi.org URI prefix - [ ] write a new JSONLD view for pypi and warehouse - [ ] write a JSONLD metadata spec for eggs and wheels ## Support metadata retrieval without exec'ing setup.py - develop a declarative format for expressing {sys.platform[...],}-dependent dependency edges Signed, Wes T. P.P.S. This is just a hard week for me,
On 22 August 2017 at 01:46, Wes Turner <wes.turner@gmail.com> wrote:
## Justify JSONLD - This is a graph. If we use an existing spec for graphs as JSON (ie JSONLD), we win: - all of the tools that already exist for working with said graphs in that format - easy indexability (as RDF quads) - compatibility with compatible specs like ld-signatures
No, that's the argument I've already said doesn't work, since it doesn't address the readability problem Paul mentioned: yes, JSON-LD *can* represent aribtrary graphs, but to *read* a JSON-LD data structure as a human, you need to know how *JSON-LD* represents graphs. I don't consider that an acceptable limitation: the raw metadata needs to be readable by someone that's *only* familiar with the specifics of dependency management in Python, and couldn't care less about the representation of graphs as a general concept.
## Implement JSONLD - [ ] decide which URI(s) a project on {pypi,} is identified by - some projects should not have an implicit pypi.org URI prefix - [ ] write a new JSONLD view for pypi and warehouse - [ ] write a JSONLD metadata spec for eggs and wheels
None of which are dependent on JSON-LD being the raw format for the metadata - this can instead be done as a postprocessing step that accepts any of the existing metadata formats as input. Defining such a transformation is going to be critical for your goals of making the JSON-LD representation useful anyway, as even if we defined a new metadata format tomorrow, that would still mean there were more than 700 thousand existing releases on pypi.org that didn't natively provide their metadata in that format. The added bonus of doing things that way is that it means that you *don't* need anyone else's agreement or consensus to start design work - you can do an initial proof of concept using a domain you control, similar to the way Donald started out by building the PyPI that he wished existed as an independent service before we thanked him for his efforts by lumbering him with the spectacularly difficult task of figuring out how to upgrade or replace pypi.python.org itself :)
## Support metadata retrieval without exec'ing setup.py
- develop a declarative format for expressing {sys.platform[...],}-dependent dependency edges
This is already part of PEP 508: https://www.python.org/dev/peps/pep-0508/#environment-markers This is why given a wheel file, you can *already* extract declarative dependency metadata, using the METADATA file + PEP 508. Given just an sdist, you can also do something similar by looking at PKG-INFO, but that's less reliable (since that file may not even be present, and even if it is, the sdist -> wheel build step may still inject additional dependencies). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Aug 20, 2017, at 7:09 AM, Paul Moore <p.f.moore@gmail.com> wrote:
On 19 August 2017 at 23:03, Wes Turner <wes.turner@gmail.com> wrote:
Here are some notes re: changing metadata:
I thought the proposal was to document the current state of affairs. Proposing a *change* to the JSON API would be a much bigger and more controversial proposal (and one I see little need for, personally).
I can answer this a bit more forcefully— At this point in time Warehouse is not interested in any drastic changes to it’s API until after legacy PyPI is gone. Documenting the existing APIs and small tweaks that may make it easier/better to use are fine, but any wholesale new API or large scale refactors are not going to be happening in the short term. I don’t have a major opinion on a PEP for the JSON api or not. It depends I guess on whether tools like bandersnatch/devpi/etc want to offer it. Given that this is all brought on by a PR to bandersnatch it appears that there is a reasonable argument that it is something that those tools want, and standardizing it is a good idea. — Donald Stufft
On 24 August 2017 at 04:59, Donald Stufft <donald@stufft.io> wrote:
I don’t have a major opinion on a PEP for the JSON api or not. It depends I guess on whether tools like bandersnatch/devpi/etc want to offer it. Given that this is all brought on by a PR to bandersnatch it appears that there is a reasonable argument that it is something that those tools want, and standardizing it is a good idea.
+1, especially as it will help clarify the required test cases for Warehouse as well (I'm not sure how much of the JSON API has been implemented at this point). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Thanks all for your points. Is it fine if I get a Informational PEP going to discuss 'Metadata Repository API'? I will structure it similar to PEP503, but also talk about: - The data exposed today (Specification) -- And possibly call on some PyPI people to correct me where I guess wrong - How mirrors should mirror it - Possible furure enhancements (JSONLD etc.) What else should we have in this PEP? Cooper
On Aug 23, 2017, at 7:04 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 24 August 2017 at 04:59, Donald Stufft <donald@stufft.io> wrote:
I don’t have a major opinion on a PEP for the JSON api or not. It depends I guess on whether tools like bandersnatch/devpi/etc want to offer it. Given that this is all brought on by a PR to bandersnatch it appears that there is a reasonable argument that it is something that those tools want, and standardizing it is a good idea.
+1, especially as it will help clarify the required test cases for Warehouse as well (I'm not sure how much of the JSON API has been implemented at this point).
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On 24 August 2017 at 13:59, Cooper Ry Lees <me@cooperlees.com> wrote:
Thanks all for your points.
Is it fine if I get a Informational PEP going to discuss 'Metadata Repository API'? I will structure it similar to PEP503, but also talk about:
- The data exposed today (Specification)
+1
-- And possibly call on some PyPI people to correct me where I guess wrong
Yep, that's part of the PEP review process - to indicate this, use the same BDFL-Delegate and Discussions-To values as are in PEP 503.
- How mirrors should mirror it
In particular, it would be ideal if they could generate it themselves from the already mirrored PEP 503 data, rather than having to make additional API calls to the main PyPI server.
- Possible furure enhancements (JSONLD etc.)
Discussion of future enhancements doesn't really belong in an Informational PEP. Instead, I've filed https://github.com/pypa/packaging-problems/issues/102 as a common place for folks to put notes about what they may want to see in a possible future API. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (6)
-
Cooper Ry Lees
-
Cooper Ry Lees
-
Donald Stufft
-
Nick Coghlan
-
Paul Moore
-
Wes Turner