Provide a way to bundle and extract license files

The project I am working on now needs to include license files for all of the 3rd party code that it includes. Since license files are not included in distribution packages, the process for doing this is exceedingly complex, error prone, and in some cases, impossible. The build process builds URLs to license files in package source code repositories based on version number, but this is dependent upon several things… 1. The source code repository is accessible via the Internet and still exists. 2. There is a tag for each version (not always true). 3. Tags follow a consistent naming pattern with respect to version numbers (not always true). I believe that it makes sense to provide a standard means for distribution packages to include license files and then encourage package authors to use it.

Steve Jorgensen wrote:
The project I am working on now needs to include license files for all of the 3rd party code that it includes. Since license files are not included in distribution packages, the process for doing this is exceedingly complex, error prone, and in some cases, impossible. The build process builds URLs to license files in package source code repositories based on version number, but this is dependent upon several things…
The source code repository is accessible via the Internet and still exists. There is a tag for each version (not always true). Tags follow a consistent naming pattern with respect to version numbers (not always true).
I believe that it makes sense to provide a standard means for distribution packages to include license files and then encourage package authors to use it.
This would be similar to `{ "license" : "SEE LICENSE IN <filename>" }` in an npm package. See also https://www.npmjs.com/package/license-extractor .

Forgive me if I'm missing something but doesn't license-file provides this functionality (see https://stackoverflow.com/a/48691876) for an example. I surmise not enough people use it although it's readily available? Sent from my phone with my typo-happy thumbs. Please excuse my brevity On Sat, Feb 22, 2020, 13:35 Steve Jorgensen <stevej@stevej.name> wrote:
Steve Jorgensen wrote:
The project I am working on now needs to include license files for all of the 3rd party code that it includes. Since license files are not included in distribution packages, the process for doing this is exceedingly complex, error prone, and in some cases, impossible. The build process builds URLs to license files in package source code repositories based on version number, but this is dependent upon several things…
The source code repository is accessible via the Internet and still exists. There is a tag for each version (not always true). Tags follow a consistent naming pattern with respect to version numbers (not always true).
I believe that it makes sense to provide a standard means for distribution packages to include license files and then encourage package authors to use it.
This would be similar to `{ "license" : "SEE LICENSE IN <filename>" }` in an npm package. See also https://www.npmjs.com/package/license-extractor . -- Distutils-SIG mailing list -- distutils-sig@python.org To unsubscribe send an email to distutils-sig-leave@python.org https://mail.python.org/mailman3/lists/distutils-sig.python.org/ Message archived at https://mail.python.org/archives/list/distutils-sig@python.org/message/CSS5L...

Ian Stapleton Cordasco wrote:
Forgive me if I'm missing something but doesn't license-file provides this functionality (see https://stackoverflow.com/a/48691876) for an example. I surmise not enough people use it although it's readily available?
Apparently, it was me who was missing something. It looks like many or possibly most of the packages we are using actually do contain license files. I guess I originally thought that none did because the first couple of packages that I inspected when originally setting up our process did not.

Steve Jorgensen wrote:
Forgive me if I'm missing something but doesn't license-file provides this functionality (see https://stackoverflow.com/a/48691876) for an example. I surmise not enough people use it although it's readily available? Apparently, it was me who was missing something. It looks like many or possibly most of the packages we are using actually do contain license files. I guess I originally
Ian Stapleton Cordasco wrote: thought that none did because the first couple of packages that I inspected when originally setting up our process did not.
Looking at the files installed for Django 2.2.10 as an example, I see that there is a LICENSE.txt file in the Django-2.2.10.dist-info directory, but there is no metadata in the installation indicating that it is the license file. The setup.cfg file in the Django source code repo does not include a `license_files =` entry. Looking at psycopg2 2.8.4 as another example, there is a LICENSE file in the psycopg2-2.8.4.dist-info directory, but again there is no metadata indicating that this is the license file. In the source code repository for psycopg2, there is a `license_file = LICENSE` entry in setup.cfg (supposed to be `license_files =` ?) but there is nothing in the distribution to reflect that. I suppose for packages like these that do include license files, I can look for files named things like LICENSE.*, COPYING.*, etc. but it would be nice if there was something in the metadata for the installed package that specifically points to the license file(s).

On Sun, 23 Feb 2020 at 08:40, Ian Stapleton Cordasco <graffatcolmingov@gmail.com> wrote:
Forgive me if I'm missing something but doesn't license-file provides this functionality (see https://stackoverflow.com/a/48691876) for an example.
I surmise not enough people use it although it's readily available?
This is likely to be the case, as license-file[s] is a setuptools feature aimed at ensuring the license file ends up in the sdist/wheel archive, rather than a published metadata field aimed at allowing other tools to *find* that license file within the sdist/wheel archive. There's a pre-draft PEP in discussion at https://github.com/pombredanne/spdx-pypi-pep/pull/2 and https://discuss.python.org/t/improving-license-clarity-with-better-package-m... that looks at clarifying licensing metadata through the use of SPDX classifiers. That draft PEP also formalises the "License-File" field. The approach I'm currently taking to this problem is to combine https://github.com/nexB/scancode-toolkit/blob/develop/README.rst for finding component licenses with https://github.com/nexB/aboutcode-toolkit to generate an open source attribution bundle for those components. The one key caveat on that approach is that the initial scancode output requires some non-trivial cleanup before you can feed it into the aboutcode ABOUT file generator when first applying it to a project: https://github.com/nexB/aboutcode-toolkit/issues/416 Cheers, Nick. P.S. As with a lot of distribution related issues, the key challenge with making improvements in this space is that developers really need tools that work *today* to meet their open source attribution obligations (such as nexB's scancode & aboutcode toolkits), while metadata level improvements (like Philippe's draft PEP) will take years to cover a significant proportion of published packages (and there's a long tail of rarely updated projects that may never catch up). -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan wrote:
On Sun, 23 Feb 2020 at 08:40, Ian Stapleton Cordasco graffatcolmingov@gmail.com wrote:
Forgive me if I'm missing something but doesn't license-file provides this functionality (see https://stackoverflow.com/a/48691876) for an example. I surmise not enough people use it although it's readily available? This is likely to be the case, as license-file[s] is a setuptools
feature aimed at ensuring the license file ends up in the sdist/wheel archive, rather than a published metadata field aimed at allowing other tools to find that license file within the sdist/wheel archive. There's a pre-draft PEP in discussion at https://github.com/pombredanne/spdx-pypi-pep/pull/2 and https://discuss.python.org/t/improving-license-clarity-with-better-package-m... that looks at clarifying licensing metadata through the use of SPDX classifiers. That draft PEP also formalises the "License-File" field. The approach I'm currently taking to this problem is to combine https://github.com/nexB/scancode-toolkit/blob/develop/README.rst for finding component licenses with https://github.com/nexB/aboutcode-toolkit to generate an open source attribution bundle for those components. The one key caveat on that approach is that the initial scancode output requires some non-trivial cleanup before you can feed it into the aboutcode ABOUT file generator when first applying it to a project: https://github.com/nexB/aboutcode-toolkit/issues/416 Cheers, Nick. P.S. As with a lot of distribution related issues, the key challenge with making improvements in this space is that developers really need tools that work today to meet their open source attribution obligations (such as nexB's scancode & aboutcode toolkits), while metadata level improvements (like Philippe's draft PEP) will take years to cover a significant proportion of published packages (and there's a long tail of rarely updated projects that may never catch up).
Thanks for your very informative & useful reply. :)

Steve, Nick: On Sun, Feb 23, 2020 at 10:04 AM Steve Jorgensen <stevej@stevej.name> wrote:
Nick Coghlan wrote:
On Sun, 23 Feb 2020 at 08:40, Ian Stapleton Cordasco graffatcolmingov@gmail.com wrote:
Forgive me if I'm missing something but doesn't license-file provides this functionality (see https://stackoverflow.com/a/48691876) for an example. I surmise not enough people use it although it's readily available? This is likely to be the case, as license-file[s] is a setuptools
feature aimed at ensuring the license file ends up in the sdist/wheel archive, rather than a published metadata field aimed at allowing other tools to find that license file within the sdist/wheel archive. There's a pre-draft PEP in discussion at https://github.com/pombredanne/spdx-pypi-pep/pull/2 and https://discuss.python.org/t/improving-license-clarity-with-better-package-m... that looks at clarifying licensing metadata through the use of SPDX classifiers. That draft PEP also formalises the "License-File" field. The approach I'm currently taking to this problem is to combine https://github.com/nexB/scancode-toolkit/blob/develop/README.rst for finding component licenses with https://github.com/nexB/aboutcode-toolkit to generate an open source attribution bundle for those components. The one key caveat on that approach is that the initial scancode output requires some non-trivial cleanup before you can feed it into the aboutcode ABOUT file generator when first applying it to a project: https://github.com/nexB/aboutcode-toolkit/issues/416 Cheers, Nick. P.S. As with a lot of distribution related issues, the key challenge with making improvements in this space is that developers really need tools that work today to meet their open source attribution obligations (such as nexB's scancode & aboutcode toolkits), while metadata level improvements (like Philippe's draft PEP) will take years to cover a significant proportion of published packages (and there's a long tail of rarely updated projects that may never catch up).
Thanks for your very informative & useful reply. :)
Nick: What you are doing with the scancode and aboutcode toolkits seems super yummy and would likely be super useful elsewhere! If you think there is something we could extract to make it part of the tools, I am game to help. And I need to submit that draft PEP BTW :] Steve: that PEP eventually documents the de-facto undocumented thing that includes license_file(s) in built wheels. The field already exists and is supported already so it can be used. To Nick's point it is going to take a long while to fix it all in the actual packages. That said, I am also involved in an initiative to help along the way and hopefully will help take only 100 years instead of the original thousand years needed to fix the problem (See https://clearlydefined.io ) There we are 1. scanning with scancode ALL the packages (Python + everything else if there is such a thing ;) ) 2. licensing data quality is "scored" with this approach https://github.com/clearlydefined/license-score/blob/master/ClearlyLicensedM... The license scoring includes if the full license text is present or not in the package (which is your original concern). 3. volunteers are reviewing that data for accuracy and correctness and fixing it if needed. 4. eventually fixes are pushed back upstream. There is also some Google summer of project https://github.com/nexB/aboutcode/wiki/Project-Ideas-Improve-License-Detecti... to do some large scale analysis of the 10M scans we have on hand. Do not hesitate to reach out on our off list. -- Cordially Philippe Ombredanne pom@nexb.com (scancode-toolkit maintainer)

On Sat, 7 Mar 2020 at 08:57, Philippe Ombredanne <pombredanne@nexb.com> wrote:
Nick: What you are doing with the scancode and aboutcode toolkits seems super yummy and would likely be super useful elsewhere! If you think there is something we could extract to make it part of the tools, I am game to help.
Alas, what I'm doing (/planning to do) isn't automated, it's just a matter of "Run scancode to find out what you have, run aboutcode to generate an attribution bundle, and do whatever data mangling you need to do in the middle to actually produce the ABOUT files". So the first pass is likely to be pretty tedious and time consuming for an already well-established project, but incremental maintenance afterwards shouldn't be too bad (add a new ABOUT file each time you ship a new dependency).
That said, I am also involved in an initiative to help along the way and hopefully will help take only 100 years instead of the original thousand years needed to fix the problem (See https://clearlydefined.io )
Oh, very cool! Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (4)
-
Ian Stapleton Cordasco
-
Nick Coghlan
-
Philippe Ombredanne
-
Steve Jorgensen