
Steve, Nick: On Sun, Feb 23, 2020 at 10:04 AM Steve Jorgensen <stevej@stevej.name> wrote:
Nick Coghlan wrote:
On Sun, 23 Feb 2020 at 08:40, Ian Stapleton Cordasco graffatcolmingov@gmail.com wrote:
Forgive me if I'm missing something but doesn't license-file provides this functionality (see https://stackoverflow.com/a/48691876) for an example. I surmise not enough people use it although it's readily available? This is likely to be the case, as license-file[s] is a setuptools
feature aimed at ensuring the license file ends up in the sdist/wheel archive, rather than a published metadata field aimed at allowing other tools to find that license file within the sdist/wheel archive. There's a pre-draft PEP in discussion at https://github.com/pombredanne/spdx-pypi-pep/pull/2 and https://discuss.python.org/t/improving-license-clarity-with-better-package-m... that looks at clarifying licensing metadata through the use of SPDX classifiers. That draft PEP also formalises the "License-File" field. The approach I'm currently taking to this problem is to combine https://github.com/nexB/scancode-toolkit/blob/develop/README.rst for finding component licenses with https://github.com/nexB/aboutcode-toolkit to generate an open source attribution bundle for those components. The one key caveat on that approach is that the initial scancode output requires some non-trivial cleanup before you can feed it into the aboutcode ABOUT file generator when first applying it to a project: https://github.com/nexB/aboutcode-toolkit/issues/416 Cheers, Nick. P.S. As with a lot of distribution related issues, the key challenge with making improvements in this space is that developers really need tools that work today to meet their open source attribution obligations (such as nexB's scancode & aboutcode toolkits), while metadata level improvements (like Philippe's draft PEP) will take years to cover a significant proportion of published packages (and there's a long tail of rarely updated projects that may never catch up).
Thanks for your very informative & useful reply. :)
Nick: What you are doing with the scancode and aboutcode toolkits seems super yummy and would likely be super useful elsewhere! If you think there is something we could extract to make it part of the tools, I am game to help. And I need to submit that draft PEP BTW :] Steve: that PEP eventually documents the de-facto undocumented thing that includes license_file(s) in built wheels. The field already exists and is supported already so it can be used. To Nick's point it is going to take a long while to fix it all in the actual packages. That said, I am also involved in an initiative to help along the way and hopefully will help take only 100 years instead of the original thousand years needed to fix the problem (See https://clearlydefined.io ) There we are 1. scanning with scancode ALL the packages (Python + everything else if there is such a thing ;) ) 2. licensing data quality is "scored" with this approach https://github.com/clearlydefined/license-score/blob/master/ClearlyLicensedM... The license scoring includes if the full license text is present or not in the package (which is your original concern). 3. volunteers are reviewing that data for accuracy and correctness and fixing it if needed. 4. eventually fixes are pushed back upstream. There is also some Google summer of project https://github.com/nexB/aboutcode/wiki/Project-Ideas-Improve-License-Detecti... to do some large scale analysis of the 10M scans we have on hand. Do not hesitate to reach out on our off list. -- Cordially Philippe Ombredanne pom@nexb.com (scancode-toolkit maintainer)