[Distutils] Maintaining a curated set of Python packages

Thu Dec 15 07:13:28 EST 2016

It's interesting to read about how other distributions upgrade their
package sets. In Nixpkgs most packages are updated manually. Some
frameworks/languages provide their dependencies declarative, in which case
it becomes 'straightforward' to include whole package sets, like in the
case of Haskell. Some expressions need to be overridden manually because
e.g. they require certain system libraries. It's manual work, but not that
much. This is what I would like to see for Python as well.

The Python packages we still update manually although we have tools to
automate most of it. The reason for not using those tools is because a) it
means evaluating or building parts of the packages to get the dependencies
and b) too often upstream pins versions of dependencies which turns out to
be entirely unnecessary, and would therefore prevent an upgrade. Even so,
we have over 1500 Python packages per interpreter version that according to
our CI seem to work together. We do only build on 3 architectures (i386,
amd64 and darwin/osx). Compatibility with the latter is sometimes an issue
because its guessing what Apple has changed when releasing a new version.

> I'm not sure how useful it would be higher up the food chain, since those
contexts will be
> different enough to cause both false positives and false negatives.  And
it
> does often take quite a bit of focused engineering effort to monitor
packages
> which don't promote (something we want to automate),

In my experience the manual work that typically needs to be done is a)
making available Python dependencies, b) unpinning versions when
unnecessary and report upstream, and c) making sure the package finds the
system libraries. Issues with packages that cannot be upgraded because of a
version of a system dependency I haven't yet encountered. In my proposal a)
and b) would be fixed by the curated package set.

> https://github.com/nvie/pip-tools :
> - requirements.in -> pip-compile -> requirements.txt (~pipfile.lock)

Yep, there are tools, which get the job done when developing and using a
set of packages. Now, you want to deploy your app. You can't use your
requirements.txt on a Linux distribution because they have a curated set of
packages which is typically different from your set (although maybe they do
provide tools to package the versions you need). But, you can choose to use
a second package manager, pip or conda. System dependencies? That's
something conda can somewhat take of. Problem solved you would say, except
now you have multiple package managers that you need.

> Practically, a developer would want a subset of the given known-good-set
(and then additional packages), so:
>
> - fork/copy requirements-YYYY-MM-REV--<OSNAMEVER>.txt
> - #comment out unused deps
> - add '-r addl-requirements.txt'

See the link I shared earlier on how this is already done with Haskell and
stack.yaml and how it could be used with `pipfile`
https://github.com/pypa/pipfile/issues/10#issuecomment-262229620

> Putting the conclusion first, I do see value in better publicising
> "Recommended libraries" based on some automated criteria like:

Yes, we should recommend third-party libraries in a trusted place like the
documentation of CPython. The amount of packages that are available can be
overwhelming. Yet, defining a set of packages that are recommended, and
perhaps working together, is still far from defining an exact set of
packages that are known to work together, something which I proposed here.

> As pointed out by others, there are external groups doing "curating".
conda-forge is one such project, so I'll comment from that perspective

I haven't used conda in a long time, and conda-forge didn't exist back
then. I see versions are pinned, but versions of dependencies sometimes as
well. If I choose to install *all* the packages available via conda-forge,
will I get a fixed package set, or will the SAT-solver try to find a
working set (and possibly fail at it)? I hope it is the former, since if it
is the latter then it is not curated in how I meant it.

On Thu, Dec 15, 2016 at 6:22 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On 15 December 2016 at 03:41, Chris Barker <chris.barker at noaa.gov> wrote:
> [Barry wrote]
> >> Ubuntu has an elaborate automated system for testing some dimension of
> >> compatibility issues between packages, not just Python packages.  Debian
> >> has
> >> the same system but isn't gated on the results.
> >
> > This brings up the larger issue -- PyPi is inherently different than
> these
> > efforts -- PyPi has always been about each package author maintaining
> their
> > own package -- Linux distros and conda-forge, and ??? all have a small
> set
> > of core contributions that do the package maintenance.
>
> Fedora at least has just shy of 1900 people in the "packager" group,
> so I don't know that "small" is the right word in absolute terms :)
>
> However, relatively speaking, even a packager group that size is still
> an order of magnitude smaller than the 30k+ publishers on PyPI (which
> is in turn an order of magnitude smaller than the 180k+ registered
> PyPI accounts)
>
> > This is a large
> > effort, and wold be insanely hard with the massive amount of stuff on
> > PyPi....
> >
> > In fact, I think the kinda-sort curation that comes from individual
> > communities is working remarkably well:
> >
> > the scipy community
> > the django community
> > ...
>
> Exactly. Armin Ronacher and a few others have also started a new
> umbrella group on GitHub, Pallets, collecting together some of the key
> infrastructure projects in the Flask ecosystem:
> https://www.palletsprojects.com/blog/hello/
>
> Dell/EMC's John Mark Walker has a recent article about this
> "downstream distribution" formation process on opensource.com, where
> it's an emergent phenomenon arising from the needs of people that are
> consuming open source components to achieve some particular purpose
> rather than working on them for their own sake:
> https://opensource.com/article/16/12/open-source-software-supply-chain
>
> It's a fairly different activity from pure upstream development -
> where upstream is a matter of "design new kinds and versions of Lego
> bricks" (e.g. the Linux kernel, gcc, CPython, PyPI projects),
> downstream integration is more "define new Lego kits using the already
> available bricks" (e.g. Debian, Fedora, conda-forge), while commercial
> product and service development is "We already put the Lego kit
> together for you, so you can just use it" (e.g. Ubuntu, RHEL, Amazon
> Linux, ActivePython, Enthought Canopy, Wakari.io).
>
> Cheers,
> Nick.
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
> _______________________________________________
> Distutils-SIG maillist  -  Distutils-SIG at python.org
> https://mail.python.org/mailman/listinfo/distutils-sig
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20161215/cfd83ad9/attachment.html>