[Distutils] Maintaining a curated set of Python packages

Nick Coghlan ncoghlan at gmail.com
Fri Dec 2 23:39:01 EST 2016

On 3 December 2016 at 03:34, Freddy Rietdijk <freddyrietdijk at fridh.nl> wrote:
> On Fri, Dec 2, 2016 at 4:33 PM, Robert T. McGibbon <rmcgibbo at gmail.com>
> wrote:
>> Isn't this issue already solved by (and the raison d'être of) the multiple
>> third-party Python redistributors, like the various OS package maintainers,
>> Continuum's Anaconda, Enthought Canopy, ActiveState Python, WinPython, etc?
> My intention is not creating yet another distribution. Instead, I want to
> see if there is interest in the different distributions on sharing some of
> the burden of curating by bringing up this discussion and seeing what is
> needed. These distributions have their recipes that allow them to build
> their packages using their tooling. What I propose is having some of that
> data community managed so the distributions can use that along with their
> tooling to build the eventual packages.

There's definitely interest in more automated curation such that
publishing through PyPI means you get pre-built binary artifacts and
compatibility testing for popular platforms automatically, but the
hard part of that isn't really the technical aspects, it's developing
a robust funding and governance model for the related sustaining
engineering activities.

That upstream component level "yes it builds" and "yes it passes its
self-tests" data is then useful to redistributors, since it would make
it straightforward to filter out releases that don't even build or
pass their own tests even before they make it into a downstream review

> These are interesting issues you bring up here. What I seek is having a set
> that has per package a version, source, Python dependencies and build
> system. Other dependencies would be for now left out, unless someone has a
> good idea how to include those. Distributions can take this curated set and
> extend the data with their distribution specific things. For example, in Nix
> we could load such a set, map a function that builds the packages in the
> set, and override what is passed to the function when necessary (e.g. to add
> system dependencies, our patches, or how tests are invoked, and so on).

Something that could be useful on that front is to mine the stdlib
documentation for "seealso" references to third party libraries and
collect them into an automation-friendly reference API. The benefit of
that approach is that it:

- would be immediately useful in its own right as a "stdlib++" definition
- solves the scope problem (the problem tackled has to be common
enough to have a default solution in the standard library, but complex
enough that there are recommended alternatives)
- solves the governance problem (the approval process for new entries
is to get them referenced from the relevant stdlib module

> Responsiveness is indeed an interesting issue. If there's enough backing,
> then I imagine security issues will be resolved as fast as they are nowadays
> by the distributions backing the initiative.

Not necessarily, as many of those responsiveness guarantees rely on
the ability of redistributors to carry downstream patches, even before
there's a corresponding upstream security release. This is especially
so for upstream projects that follow an as-needed release model,
without much (if any) automation of their publication process.

>> If a curation community *isn't* doing any of those things, then it isn't
>> adding a lot of value beyond folks just doing DIY integration in their CI
>> system by pinning their dependencies to particular versions.
> I would imagine that distributions that would support this idea would have a
> CI tracking packages built using the curated set and the
> distribution-specific changes. When there's an issue they could fix it at
> their side, or if it is something that might belong in the curated set, they
> would report the issue. At some point, when they would freeze, they would
> pin to a certain YYYY.MM and API breakage should not occur.

Yeah, this is effectively what happens already, it's just not
particularly visible outside the individual redistributor pipelines.

> libraries.io is a very interesting initiative. It seems they scan the
> contents of the archives and extract dependencies based on what is in the
> requirements files, which is often more than is actually needed for building
> and running the package. They would benefit from having a declarative style
> for the dependencies and build system, but that is another issue (PEP 517
> e.g.) than what I bring up here. We also have a tool that runs pip in a
> sandbox to determine the dependencies, and then provide us with an
> expression. It works, but it shouldn't be necessary.

Alas, with 94k+ setup.py based packages already in the wild, arbitrary
code execution for dependency metadata generation is going to be with
us for a while. That said, centralised services like libraries.io
should lead to more folks being able to just use their already
collected dependency data (even if it isn't the minimal dependency
set) and avoid having to generate it themselves.


Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia

More information about the Distutils-SIG mailing list