On Thursday, December 15, 2016, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 16 December 2016 at 05:50, Paul Moore <p.f.moore@gmail.com> wrote:
> On 15 December 2016 at 19:13, Wes Turner <wes.turner@gmail.com> wrote:
>>> Just to add my POV, I also find your posts unhelpful, Wes. There's not
>>> enough information for me to evaluate what you say, and you offer no
>>> actual solutions to what's being discussed.
>>
>>
>> I could quote myself suggesting solutions in this thread, if you like?
>
> You offer lots of pointers to information. But that's different.

Exactly. There are *lots* of information processing standards out
there, and lots of things we *could* provide natively that simply
aren't worth the hassle since folks that care can provide them as
"after market addons" for the audiences that considers them relevant.

For example, a few things that can matter to different audiences are:

- SPDX (Software Package Data Exchange) identifiers for licenses
- CPE (Common Product Enumeration) and SWID (Software Identification)
tags for published software
- DOI (Digital Object Identifier) tags for citation purposes
- Common Criteria certification for software supply chains

These are called properties with RDFS.

It takes very little effort to add additional properties. If the unqualified attribute is not listed in a JSONLD @context, it can still be added by specifying a URI
 

I don't push for these upstream in distutils-sig not because I don't
think they're important in general, but because I *don't think they're
a priority for distutils-sig*. If you're teaching Python to school
students, or teaching engineers and scientists how to better analyse
their own data, or building a web service for yourself or your
employer, these kinds of things simply don't matter.

#31 lists a number of advantages.
OTOMH, CVE security reports could be linked to the project/package URI (and thus displayed along with the project detail page)
 

The end users that care about them are well-positioned to tackle them
on their own (or pay other organisations to do it for them), and
because they span arbitrary publishing communities anyway, it doesn't
really matter all that much if any given publishing community
participates directly in the process (the only real beneficiaries are
the intermediaries that actively blur the distinctions between the
cooperative communities and the recalcitrant ones).

Linked Data minimizes
 

> Anyway, let's just agree to differ - I can skip your mails if they
> aren't helpful to me, and you don't need to bother about the fact that
> you're not getting your points across to me.

I consider it fairly important that we have a reasonably common
understanding of the target userbase for direct consumption of PyPI
data, and what we expect to be supplied as third party services. It's
also important that we have a shared understanding of how to
constructively frame proposals for change.

When I can afford the time, I'll again take a look at fixing the metadata specification once and for all by (1) defining an @context for the existing metadata, and (2) producing an additional pydist.jsonld TODO metadata document (because the releases are currently keyed by version), and (3) adding the model attribute and view to Warehouse.
 

For the former, the Semantic Web, and folks that care about Semantic
Web concepts like "Linked Data" in the abstract sense are not part of
our primary audience. We don't go out of our way to make their lives
difficult, but "it makes semantic analysis easier" also isn't a
compelling rationale for change.

Unfortunately, you types are not well-versed in the problems that Linked Data solves: it's all your data in your schema in your database; and URIs are far less useful than RAM-local references (pointers).

See: BP-LD
 

For the latter, some variants of constructive proposals look like:

- "this kind of user has this kind of problem and this proposed
solution will help mitigate it this way (and, by the way, here's an
existing standard we can use)"
- "this feature exists in <third party tool or service>, it's really
valuable to users for <these reasons>, how about we offer it by
default?"
- "I wrote <thing> for myself, and I think it would also help others
for <these reasons>, can you help me make it more widely known and
available?"
 
One could stuff additional metadata in # comments of a requirements.txt, but that would be an ad-hoc parsing scheme with a SPOF tool dependency.


They don't look like "Here's a bunch of technologies and organisations
that exist on the internet that may in some way potentially be
relevant to the management of a software distribution network", and
nor does it look like "This data modeling standard exists, so we
should use it, even though it doesn't actually simplify our lives or
our users' lives in any way, and in fact makes them more complicated".

Those badges we all (!) add to our README.rst long_descriptions point to third-party services with lots of potentially structured linked data that is very relevant to curating a collection of resources: test coverage, build stats, discoverable documentation which could be searched en-masse, security vulnerability reports, downstream packages;
but they're unfortunately just <a href> links;
whereas they could be <a href property="URI"> edges that other tools could make use of.
 

> Who knows, one day I
> might find the time to look into JSON-LD, at which point I may or may
> not understand why you think it's such a useful tool for solving all
> these problems (in spite of the fact that no-one else seems to think
> the same...)

It would be logically fallacious of me to, without an understanding of a web standard graph representation format, suggest that it's not sufficient (or ideally-suited) for these very use cases.

#31 TODO somewhat laboriously lists ROI;
Though I haven't yet had the time for an impact study.
 

I *have* looked at JSON-LD (based primarily on Wes's original
suggestions), both from the perspective of the Python packaging
ecosystem specifically, as well as my day job working on software
supply chain management.

I recognize your expertise and your preference for given Linux distributions.

I can tell you that, while many of the linked data examples describe social graph applications regarding Bob and Alice, there are very many domains where Linked Data is worth learning: medicine (research, clinical), open government data (where tool-dependence is a no-no and a lost opportunity).

When you have data in lots of different datasets, it really starts to make sense to:
- use URIs as keys
- use URIs as column names
- recognize that you're just reimplementing graph semantics which are already well-solved (RDF, RDFS, OWL, and now JSONLD because JS)
 

My verdict was that for managing a dependency graph implementation, it
ends up in the category of technologies that qualify as "interesting,
but not helpful". In many ways, it's the urllib2 of data linking -
just as urllib2 gives you a URL handling framework which you can
configure to handle HTTP rather than just providing a HTTP-specific
interface the way requests does [1], JSON-LD gives you a data linking
framework, which you can then use to define links between your data,
rather than just linking the data directly in a domain-appropriate
fashion. Using a framework for the sake of using a framework rather
than out of a genuine engineering need doesn't tend to lead to good
software systems.

Interesting analogy.
urllib, urlparse, urllib2, urllib3, requests; and now we have cert hostname checking.

SemWeb standards are also layered. There were other standards for triples in JSON that do still exist, but none could map an existing JSON document to RDF with such flexibility.

Followed by a truism.

 

Wes seems to think that my perspective on this is born out of
ignorance, so repeatedly bringing it up may make me change my point of
view. However, our problems haven't changed, and the nature and
purpose of JSON-LD haven't changed, so it really won't - the one thing
that will change my mind is demonstrated popularity and utility of a
service that integrates raw PyPI data with JSON-LD and schema.org.

Hence my suggestions (made with varying degrees of politeness) to go
build a dependency analysis service that extracts the dependency trees
from libraries.io, maps them to schema.org concepts in a way that
makes sense, and then demonstrate what that makes possible that can't
be done with the libraries.io data directly.

A chicken-and-egg problem, ironically.

There are many proprietary solutions for aggregating software quality information; all of which must write parsers and JOIN logic for each packaging ecosystem's ad-hoc partial graph implementations.

When the key of a thing is a URI, other datasets which reference the same URI just magically join together:
The justifying metadata for a package in a curated collection could just join with the actual package metadata (and the aforementioned datasources)
 

Neither Wes nor anyone else needs anyone's permission to go do that,
and it will be far more enjoyable for all concerned than the status
quo where Wes is refusing to take "No, we've already looked at
JSON-LD, and we believe it adds needless complexity for no benefit
that we care about" for an answer by continuing to post about it
*here* rather than either venting his frustrations about our
collective lack of interest somewhere else, or else channeling that
frustration into building the system he wishes existed.

If you've never written an @context for an existing JSON schema, I question both your assessment of complexity and your experience with sharing graph data with myriad applications; 
But that's irrelevant,
Because here all I think I need is a table of dataset-local autoincrement IDs and some columns
And ALTER TABLE migrations,
And then someone else can write a parser for the schema I expose with my JSON REST API,
So that I can JOIN this data with other useful datasets
(In order to share a versioned Collection of CreativeWorks which already have URIs;
Because I'm unsatisfied with requirements.txt
Because it's line-based,
And I can't just add additional attributes,
And there's no key because indexes and editable URLs stuffed with checksums and eggwheel names,
Oh and the JSON metadata specification is fixed and doesn't support URI attribute names
So I can't just add additional attributes or values from a controlled vocabulary as-needed,
Unless it's ReStructuredText-come-HTML
(now with pypi:readme)

Is that in PEP form?

Someone should really put together an industry-group to produce some UML here,
So our tools can talk, 
And JOIN on URIs
With versions, platforms, and custom URI schemes,
In order to curate a collection of packages,
As a team,
With group permissions, 
according to structured criteria (and comments!)

Because "ld-signatures"


A #LinkedMetaAnalyses (#LinkedReproducibility) application would similarly support curation of resources with URIs and defined criteria in order to elicit redundant expert evaluations of CreativeWorks (likely with JSONLD). In light of the reception here, that may be a better use of resources.


I've said my piece,
I'll leave you guys to find a solution to this use case which minimizes re-work and maximizes data integration potential.

[Pip checksum docs, JSONLD] https://github.com/pypa/interoperability-peps/issues/31#issuecomment-160970112


Cheers,
Nick.

[1] http://www.curiousefficiency.org/posts/2016/08/what-problem-does-it-solve.html

--
Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia