On Thursday, December 15, 2016, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 16 December 2016 at 05:50, Paul Moore <p.f.moore@gmail.com <javascript:;>> wrote:
On 15 December 2016 at 19:13, Wes Turner <wes.turner@gmail.com <javascript:;>> wrote:
Just to add my POV, I also find your posts unhelpful, Wes. There's not enough information for me to evaluate what you say, and you offer no actual solutions to what's being discussed.
I could quote myself suggesting solutions in this thread, if you like?
You offer lots of pointers to information. But that's different.
Exactly. There are *lots* of information processing standards out there, and lots of things we *could* provide natively that simply aren't worth the hassle since folks that care can provide them as "after market addons" for the audiences that considers them relevant.
For example, a few things that can matter to different audiences are:
- SPDX (Software Package Data Exchange) identifiers for licenses - CPE (Common Product Enumeration) and SWID (Software Identification) tags for published software - DOI (Digital Object Identifier) tags for citation purposes - Common Criteria certification for software supply chains
These are called properties with RDFS. It takes very little effort to add additional properties. If the unqualified attribute is not listed in a JSONLD @context, it can still be added by specifying a URI
I don't push for these upstream in distutils-sig not because I don't think they're important in general, but because I *don't think they're a priority for distutils-sig*. If you're teaching Python to school students, or teaching engineers and scientists how to better analyse their own data, or building a web service for yourself or your employer, these kinds of things simply don't matter.
#31 lists a number of advantages. OTOMH, CVE security reports could be linked to the project/package URI (and thus displayed along with the project detail page)
The end users that care about them are well-positioned to tackle them on their own (or pay other organisations to do it for them), and because they span arbitrary publishing communities anyway, it doesn't really matter all that much if any given publishing community participates directly in the process (the only real beneficiaries are the intermediaries that actively blur the distinctions between the cooperative communities and the recalcitrant ones).
Linked Data minimizes
Anyway, let's just agree to differ - I can skip your mails if they aren't helpful to me, and you don't need to bother about the fact that you're not getting your points across to me.
I consider it fairly important that we have a reasonably common understanding of the target userbase for direct consumption of PyPI data, and what we expect to be supplied as third party services. It's also important that we have a shared understanding of how to constructively frame proposals for change.
When I can afford the time, I'll again take a look at fixing the metadata specification once and for all by (1) defining an @context for the existing metadata, and (2) producing an additional pydist.jsonld TODO metadata document (because the releases are currently keyed by version), and (3) adding the model attribute and view to Warehouse.
For the former, the Semantic Web, and folks that care about Semantic Web concepts like "Linked Data" in the abstract sense are not part of our primary audience. We don't go out of our way to make their lives difficult, but "it makes semantic analysis easier" also isn't a compelling rationale for change.
Unfortunately, you types are not well-versed in the problems that Linked Data solves: it's all your data in your schema in your database; and URIs are far less useful than RAM-local references (pointers). See: BP-LD
For the latter, some variants of constructive proposals look like:
- "this kind of user has this kind of problem and this proposed solution will help mitigate it this way (and, by the way, here's an existing standard we can use)" - "this feature exists in <third party tool or service>, it's really valuable to users for <these reasons>, how about we offer it by default?" - "I wrote <thing> for myself, and I think it would also help others for <these reasons>, can you help me make it more widely known and available?"
One could stuff additional metadata in # comments of a requirements.txt, but that would be an ad-hoc parsing scheme with a SPOF tool dependency.
They don't look like "Here's a bunch of technologies and organisations that exist on the internet that may in some way potentially be relevant to the management of a software distribution network", and nor does it look like "This data modeling standard exists, so we should use it, even though it doesn't actually simplify our lives or our users' lives in any way, and in fact makes them more complicated".
Those badges we all (!) add to our README.rst long_descriptions point to third-party services with lots of potentially structured linked data that is very relevant to curating a collection of resources: test coverage, build stats, discoverable documentation which could be searched en-masse, security vulnerability reports, downstream packages; but they're unfortunately just <a href> links; whereas they could be <a href property="URI"> edges that other tools could make use of.
Who knows, one day I might find the time to look into JSON-LD, at which point I may or may not understand why you think it's such a useful tool for solving all these problems (in spite of the fact that no-one else seems to think the same...)
It would be logically fallacious of me to, without an understanding of a web standard graph representation format, suggest that it's not sufficient (or ideally-suited) for these very use cases. #31 TODO somewhat laboriously lists ROI; Though I haven't yet had the time for an impact study.
I *have* looked at JSON-LD (based primarily on Wes's original suggestions), both from the perspective of the Python packaging ecosystem specifically, as well as my day job working on software supply chain management.
I recognize your expertise and your preference for given Linux distributions. I can tell you that, while many of the linked data examples describe social graph applications regarding Bob and Alice, there are very many domains where Linked Data is worth learning: medicine (research, clinical), open government data (where tool-dependence is a no-no and a lost opportunity). When you have data in lots of different datasets, it really starts to make sense to: - use URIs as keys - use URIs as column names - recognize that you're just reimplementing graph semantics which are already well-solved (RDF, RDFS, OWL, and now JSONLD because JS)
My verdict was that for managing a dependency graph implementation, it ends up in the category of technologies that qualify as "interesting, but not helpful". In many ways, it's the urllib2 of data linking - just as urllib2 gives you a URL handling framework which you can configure to handle HTTP rather than just providing a HTTP-specific interface the way requests does [1], JSON-LD gives you a data linking framework, which you can then use to define links between your data, rather than just linking the data directly in a domain-appropriate fashion. Using a framework for the sake of using a framework rather than out of a genuine engineering need doesn't tend to lead to good software systems.
Interesting analogy. urllib, urlparse, urllib2, urllib3, requests; and now we have cert hostname checking. SemWeb standards are also layered. There were other standards for triples in JSON that do still exist, but none could map an existing JSON document to RDF with such flexibility. Followed by a truism.
Wes seems to think that my perspective on this is born out of ignorance, so repeatedly bringing it up may make me change my point of view. However, our problems haven't changed, and the nature and purpose of JSON-LD haven't changed, so it really won't - the one thing that will change my mind is demonstrated popularity and utility of a service that integrates raw PyPI data with JSON-LD and schema.org.
Hence my suggestions (made with varying degrees of politeness) to go build a dependency analysis service that extracts the dependency trees from libraries.io, maps them to schema.org concepts in a way that makes sense, and then demonstrate what that makes possible that can't be done with the libraries.io data directly.
A chicken-and-egg problem, ironically. There are many proprietary solutions for aggregating software quality information; all of which must write parsers and JOIN logic for each packaging ecosystem's ad-hoc partial graph implementations. When the key of a thing is a URI, other datasets which reference the same URI just magically join together: The justifying metadata for a package in a curated collection could just join with the actual package metadata (and the aforementioned datasources)
Neither Wes nor anyone else needs anyone's permission to go do that, and it will be far more enjoyable for all concerned than the status quo where Wes is refusing to take "No, we've already looked at JSON-LD, and we believe it adds needless complexity for no benefit that we care about" for an answer by continuing to post about it *here* rather than either venting his frustrations about our collective lack of interest somewhere else, or else channeling that frustration into building the system he wishes existed.
If you've never written an @context for an existing JSON schema, I question both your assessment of complexity and your experience with sharing graph data with myriad applications; But that's irrelevant, Because here all I think I need is a table of dataset-local autoincrement IDs and some columns And ALTER TABLE migrations, And then someone else can write a parser for the schema I expose with my JSON REST API, So that I can JOIN this data with other useful datasets (In order to share a versioned Collection of CreativeWorks which already have URIs; Because I'm unsatisfied with requirements.txt Because it's line-based, And I can't just add additional attributes, And there's no key because indexes and editable URLs stuffed with checksums and eggwheel names, Oh and the JSON metadata specification is fixed and doesn't support URI attribute names So I can't just add additional attributes or values from a controlled vocabulary as-needed, Unless it's ReStructuredText-come-HTML (now with pypi:readme) Is that in PEP form? Someone should really put together an industry-group to produce some UML here, So our tools can talk, And JOIN on URIs With versions, platforms, and custom URI schemes, In order to curate a collection of packages, As a team, With group permissions, according to structured criteria (and comments!) Because "ld-signatures" A #LinkedMetaAnalyses (#LinkedReproducibility) application would similarly support curation of resources with URIs and defined criteria in order to elicit redundant expert evaluations of CreativeWorks (likely with JSONLD). In light of the reception here, that may be a better use of resources. I've said my piece, I'll leave you guys to find a solution to this use case which minimizes re-work and maximizes data integration potential. [Pip checksum docs, JSONLD] https://github.com/pypa/interoperability-peps/issues/31#issuecomment-1609701...
Cheers, Nick.
[1] http://www.curiousefficiency.org/posts/2016/08/what- problem-does-it-solve.html
-- Nick Coghlan | ncoghlan@gmail.com <javascript:;> | Brisbane, Australia