[Distutils] [proposal] shared distribution installations

Nick Coghlan ncoghlan at gmail.com
Tue Oct 31 03:22:03 EDT 2017


On 31 October 2017 at 05:16, RonnyPfannschmidt <
opensource at ronnypfannschmidt.de> wrote:

> Hi everyone,
>
> since a while now various details of installing python packages in
> virtualenvs caused me grief
>
> a) typically each tox folder in a project is massive, and has a lot of
> duplicate files, recreating them, managing and iterating them takes
> quite a while
> b) for nicely separated deployments, each virtualenv for an application
> takes a few hundred megabytes - that quickly can saturate disk space
> even if a reasonable amount was reserved
> c) installation and recreation of virtualenvs with the same set of
> packages takes quite a while (even with pip caches this is slow, and
> there is no good reason to avoid making it completely instantaneous)
>
> in order to elevate those issues i would like to propose a new
> installation layout,
> where instead of storing each distribution in every python all
> distributions would share a storage, and each individual environment
> would only have references to the packages that where
> "installed/activated" for them
>

I've spent a fair bit of time pondering this problem (since distros care
about it in relation to ease of security updates), and the combination of
Python's import semantics with the PEP 376 installation database semantics
makes it fairly tricky to improve. Fortunately, the pth-file mechanism
provides an escape hatch that makes it possible to transparently experiment
with difference approaches.

At the venv management layer, pew already supports a model similar to that
offered by the Flatpak application container format [1]: instead of
attempting to share everything, pew permits a limited form of "virtual
environment inheritance", via "pew add $(pew dir
<named-venv-to-depend-on>)" (which injects a *.pth file that appends the
other venv's site-packages directory to sys.path). Those inherited runtimes
then become the equivalent of the runtime layer in Flatpak: applications
will automatically pick up new versions of the runtime, so the runtime
maintainers are expected to strictly preserve backwards compatibility, and
when that isn't possible, provide a new parallel-installable version, so
apps using both the old and the new runtime can happily run side-by-side.

The idea behind that approach is to trade-off a bit of inflexibility in the
exact versions of some of your dependencies for the benefit of a reduction
in data duplication on systems running multiple applications or
environments: instead of specifying your full dependency set, you'd instead
only specify that you depended on a particular common computational
environment being available, plus whatever you needed that isn't part of
the assumed platform.

As semi-isolated-applications-with-a-shared-runtime mechanisms like Flatpak
gain popularity (vs fully isolated application & service silos), I'd expect
this model to start making more of an appearance in the Linux distro world,
as it's a natural way of mapping per-application venvs to the shared
runtime model, and it doesn't require any changes to installers or
applications to support it.

However, there's another approach that specifically tackles the content
duplication problem, which would require a new installation layout as you
suggest, but could still rely on *.pth files to make it implicitly
compatible with existing packages and applications and existing Python
runtime versions.

That approach is to create an install tree somewhere that looks like this:

    _shared-packages/
        <normalised-package-name>/
            <release-version>/
                <version-details>.dist-info/
                <installed-files>

Instead of installing full packages directly into a venv the way pip does,
an installer that worked this way would instead manage a
<normalised-package-name>.pth file that indicated
"_shared-packages/<normalised-package-name>/<release-version>" should be
added to sys.path. Each shared package directory could include references
back to all of the venvs where it has been installed, allowing it to be
removed when either all of those have been updated to a new version, or
else removed entirely. This is actually a *lot* like the way
pkg_resources.requires() and self-contained egg directories work, but with
the version selection shifted to the venv's site-packages directory, rather
than happening implicitly in Python code on application startup.

An interesting point about this layout is that it would be amenable to a
future enhancement that allowed for more relaxed MAJOR and MAJOR.MINOR
qualifiers on the install directory references, permitting transparently
shared maintenance and security updates.

The big downside of this layout is that it means you lose the ability to
just bundle up an entire directory and unpack it on a different machine to
get a probably-mostly-working environment. This means that while it's
likely better for managing lots of environments on a single workstation
(due to the reduced file duplication), it's likely to be worse for folks
that work on only a handful of different projects at any given point in
time (and I say that as someone with ~140 different local repository clones
across my ~/devel, ~/fedoradevel and ~/rhdevel directories).

Cheers,
Nick.

[1] http://docs.flatpak.org/en/latest/introduction.html#how-it-works

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20171031/18c05817/attachment.html>


More information about the Distutils-SIG mailing list