[proposal] shared distribution installations

Hi everyone,
since a while now various details of installing python packages in virtualenvs caused me grief
a) typically each tox folder in a project is massive, and has a lot of duplicate files, recreating them, managing and iterating them takes quite a while b) for nicely separated deployments, each virtualenv for an application takes a few hundred megabytes - that quickly can saturate disk space even if a reasonable amount was reserved c) installation and recreation of virtualenvs with the same set of packages takes quite a while (even with pip caches this is slow, and there is no good reason to avoid making it completely instantaneous)
in order to elevate those issues i would like to propose a new installation layout, where instead of storing each distribution in every python all distributions would share a storage, and each individual environment would only have references to the packages that where "installed/activated" for them
this would massively reduce time required to create the contents of the environments and also the space required
since blindly expanding sys.path would lead to similar performance issues as where seen with setuptools/buildout multi-version installs, this mechanism would also need a element on sys.meta_path that handles inexpensive dispatch to the toplevels and metadata files of each packages (off hand i would assume linear walking of hundreds of entries simply isn't that effective)
however there would be need for some experimentation to see what tradeoff is sensible there
I hope this mail will spark enough discussion to enable the creation of a PEP and a prototype.
Best, Ronny

Hi Ronny,
What you describe here is, as you know, basically what the Nix package manager does. You could create something similar specifically for Python, like e.g. `ied` is for Node [2], or Spack, which is written in Python. But then how are you going to deal with other system libraries, and impurities? And you will have to deal with them, because depending on how you configure Python packages that depend on them (say a `numpy`), their output will be different. Or would you choose to ignore this?
Freddy
[1] https://nixos.org/nix/ [2] https://github.com/alexanderGugel/ied [3] https://spack.io/
On Mon, Oct 30, 2017 at 8:16 PM, RonnyPfannschmidt < opensource@ronnypfannschmidt.de> wrote:
Hi everyone,
since a while now various details of installing python packages in virtualenvs caused me grief
a) typically each tox folder in a project is massive, and has a lot of duplicate files, recreating them, managing and iterating them takes quite a while b) for nicely separated deployments, each virtualenv for an application takes a few hundred megabytes - that quickly can saturate disk space even if a reasonable amount was reserved c) installation and recreation of virtualenvs with the same set of packages takes quite a while (even with pip caches this is slow, and there is no good reason to avoid making it completely instantaneous)
in order to elevate those issues i would like to propose a new installation layout, where instead of storing each distribution in every python all distributions would share a storage, and each individual environment would only have references to the packages that where "installed/activated" for them
this would massively reduce time required to create the contents of the environments and also the space required
since blindly expanding sys.path would lead to similar performance issues as where seen with setuptools/buildout multi-version installs, this mechanism would also need a element on sys.meta_path that handles inexpensive dispatch to the toplevels and metadata files of each packages (off hand i would assume linear walking of hundreds of entries simply isn't that effective)
however there would be need for some experimentation to see what tradeoff is sensible there
I hope this mail will spark enough discussion to enable the creation of a PEP and a prototype.
Best, Ronny
Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig

Hi Freddy,
im well aware what nix currently does for python packages and suffered my fair share from it.
What i want to do is simply store those wheels that pip would first generate and unpack into each environment into a location where each environment shares the unpacked files more directly
im not going to expand uppon my perceived shortcomings of nix as i know ,it since its irrelevant to this discussion and not something i have the time and motivation to fix.
as far as impurities go, the behaviour i aim for would be moslty like virtualenv but without the file duplication.
I beleive nix could also benefit from parts of such a mechanism.
-- Ronny
Am Montag, den 30.10.2017, 20:35 +0100 schrieb Freddy Rietdijk:
Hi Ronny,
What you describe here is, as you know, basically what the Nix package manager does. You could create something similar specifically for Python, like e.g. `ied` is for Node [2], or Spack, which is written in Python. But then how are you going to deal with other system libraries, and impurities? And you will have to deal with them, because depending on how you configure Python packages that depend on them (say a `numpy`), their output will be different. Or would you choose to ignore this?
Freddy
[1] https://nixos.org/nix/ [2] https://github.com/alexanderGugel/ied [3] https://spack.io/
On Mon, Oct 30, 2017 at 8:16 PM, RonnyPfannschmidt <opensource@ronnyp fannschmidt.de> wrote:
Hi everyone,
since a while now various details of installing python packages in virtualenvs caused me grief
a) typically each tox folder in a project is massive, and has a lot of duplicate files, recreating them, managing and iterating them takes quite a while b) for nicely separated deployments, each virtualenv for an application takes a few hundred megabytes - that quickly can saturate disk space even if a reasonable amount was reserved c) installation and recreation of virtualenvs with the same set of packages takes quite a while (even with pip caches this is slow, and there is no good reason to avoid making it completely instantaneous)
in order to elevate those issues i would like to propose a new installation layout, where instead of storing each distribution in every python all distributions would share a storage, and each individual environment would only have references to the packages that where "installed/activated" for them
this would massively reduce time required to create the contents of the environments and also the space required
since blindly expanding sys.path would lead to similar performance issues as where seen with setuptools/buildout multi-version installs, this mechanism would also need a element on sys.meta_path that handles inexpensive dispatch to the toplevels and metadata files of each packages (off hand i would assume linear walking of hundreds of entries simply isn't that effective)
however there would be need for some experimentation to see what tradeoff is sensible there
I hope this mail will spark enough discussion to enable the creation of a PEP and a prototype.
Best, Ronny
Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig

, the behaviour i aim for would be moslty like virtualenv but without the file duplication.
For what it’s worth, conda environments use hard links where possible, so limiting duplication...
Maybe conda would solve your problem...
-CHB
I beleive nix could also benefit from parts of such a mechanism.
-- Ronny
Am Montag, den 30.10.2017, 20:35 +0100 schrieb Freddy Rietdijk:
Hi Ronny,
What you describe here is, as you know, basically what the Nix
package manager does. You could create something similar specifically
for Python, like e.g. `ied` is for Node [2], or Spack, which is
written in Python. But then how are you going to deal with other
system libraries, and impurities? And you will have to deal with
them, because depending on how you configure Python packages that
depend on them (say a `numpy`), their output will be different. Or
would you choose to ignore this?
Freddy
[2] https://github.com/alexanderGugel/ied
On Mon, Oct 30, 2017 at 8:16 PM, RonnyPfannschmidt <opensource@ronnyp
fannschmidt.de> wrote:
Hi everyone,
since a while now various details of installing python packages in
virtualenvs caused me grief
a) typically each tox folder in a project is massive, and has a lot
of
duplicate files, recreating them, managing and iterating them takes
quite a while
b) for nicely separated deployments, each virtualenv for an
application
takes a few hundred megabytes - that quickly can saturate disk
space
even if a reasonable amount was reserved
c) installation and recreation of virtualenvs with the same set of
packages takes quite a while (even with pip caches this is slow,
and
there is no good reason to avoid making it completely
instantaneous)
in order to elevate those issues i would like to propose a new
installation layout,
where instead of storing each distribution in every python all
distributions would share a storage, and each individual
environment
would only have references to the packages that where
"installed/activated" for them
this would massively reduce time required to create the contents of
the
environments and also the space required
since blindly expanding sys.path would lead to similar performance
issues as where seen with setuptools/buildout multi-version
installs,
this mechanism would also need a element on sys.meta_path that
handles
inexpensive dispatch to the toplevels and metadata files of each
packages (off hand i would assume linear walking of hundreds of
entries
simply isn't that effective)
however there would be need for some experimentation to see what
tradeoff is sensible there
I hope this mail will spark enough discussion to enable the
creation of
a PEP and a prototype.
Best, Ronny
_______________________________________________
Distutils-SIG maillist - Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig

On Mon, Oct 30, 2017, at 07:16 PM, RonnyPfannschmidt wrote:
in order to elevate those issues i would like to propose a new installation layout, where instead of storing each distribution in every python all distributions would share a storage, and each individual environment would only have references to the packages that where "installed/activated" for them
This is also essentially what conda does - the references being in the form of hard links. The mechanism has some drawbacks of its own - like if a file somehow gets modified, it's harder to fix it, because removing the environment no longer removes the files.
Thomas

I would like to explicitly avoid Hardlink Farms because those still have "Logical" duplication i'd like to bind in the new paths without having it look like each virtualenv is 400-1000 mb of distinct data
-- Ronny
Am Montag, den 30.10.2017, 22:23 +0000 schrieb Thomas Kluyver:
On Mon, Oct 30, 2017, at 07:16 PM, RonnyPfannschmidt wrote:
in order to elevate those issues i would like to propose a new installation layout, where instead of storing each distribution in every python all distributions would share a storage, and each individual environment would only have references to the packages that where "installed/activated" for them
This is also essentially what conda does - the references being in the form of hard links. The mechanism has some drawbacks of its own - like if a file somehow gets modified, it's harder to fix it, because removing the environment no longer removes the files.
Thomas _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig

On 31 October 2017 at 05:16, RonnyPfannschmidt < opensource@ronnypfannschmidt.de> wrote:
Hi everyone,
since a while now various details of installing python packages in virtualenvs caused me grief
a) typically each tox folder in a project is massive, and has a lot of duplicate files, recreating them, managing and iterating them takes quite a while b) for nicely separated deployments, each virtualenv for an application takes a few hundred megabytes - that quickly can saturate disk space even if a reasonable amount was reserved c) installation and recreation of virtualenvs with the same set of packages takes quite a while (even with pip caches this is slow, and there is no good reason to avoid making it completely instantaneous)
in order to elevate those issues i would like to propose a new installation layout, where instead of storing each distribution in every python all distributions would share a storage, and each individual environment would only have references to the packages that where "installed/activated" for them
I've spent a fair bit of time pondering this problem (since distros care about it in relation to ease of security updates), and the combination of Python's import semantics with the PEP 376 installation database semantics makes it fairly tricky to improve. Fortunately, the pth-file mechanism provides an escape hatch that makes it possible to transparently experiment with difference approaches.
At the venv management layer, pew already supports a model similar to that offered by the Flatpak application container format [1]: instead of attempting to share everything, pew permits a limited form of "virtual environment inheritance", via "pew add $(pew dir <named-venv-to-depend-on>)" (which injects a *.pth file that appends the other venv's site-packages directory to sys.path). Those inherited runtimes then become the equivalent of the runtime layer in Flatpak: applications will automatically pick up new versions of the runtime, so the runtime maintainers are expected to strictly preserve backwards compatibility, and when that isn't possible, provide a new parallel-installable version, so apps using both the old and the new runtime can happily run side-by-side.
The idea behind that approach is to trade-off a bit of inflexibility in the exact versions of some of your dependencies for the benefit of a reduction in data duplication on systems running multiple applications or environments: instead of specifying your full dependency set, you'd instead only specify that you depended on a particular common computational environment being available, plus whatever you needed that isn't part of the assumed platform.
As semi-isolated-applications-with-a-shared-runtime mechanisms like Flatpak gain popularity (vs fully isolated application & service silos), I'd expect this model to start making more of an appearance in the Linux distro world, as it's a natural way of mapping per-application venvs to the shared runtime model, and it doesn't require any changes to installers or applications to support it.
However, there's another approach that specifically tackles the content duplication problem, which would require a new installation layout as you suggest, but could still rely on *.pth files to make it implicitly compatible with existing packages and applications and existing Python runtime versions.
That approach is to create an install tree somewhere that looks like this:
_shared-packages/ <normalised-package-name>/ <release-version>/ <version-details>.dist-info/ <installed-files>
Instead of installing full packages directly into a venv the way pip does, an installer that worked this way would instead manage a <normalised-package-name>.pth file that indicated "_shared-packages/<normalised-package-name>/<release-version>" should be added to sys.path. Each shared package directory could include references back to all of the venvs where it has been installed, allowing it to be removed when either all of those have been updated to a new version, or else removed entirely. This is actually a *lot* like the way pkg_resources.requires() and self-contained egg directories work, but with the version selection shifted to the venv's site-packages directory, rather than happening implicitly in Python code on application startup.
An interesting point about this layout is that it would be amenable to a future enhancement that allowed for more relaxed MAJOR and MAJOR.MINOR qualifiers on the install directory references, permitting transparently shared maintenance and security updates.
The big downside of this layout is that it means you lose the ability to just bundle up an entire directory and unpack it on a different machine to get a probably-mostly-working environment. This means that while it's likely better for managing lots of environments on a single workstation (due to the reduced file duplication), it's likely to be worse for folks that work on only a handful of different projects at any given point in time (and I say that as someone with ~140 different local repository clones across my ~/devel, ~/fedoradevel and ~/rhdevel directories).
Cheers, Nick.
[1] http://docs.flatpak.org/en/latest/introduction.html#how-it-works

Hi,
On 31 October 2017 at 05:22, Nick Coghlan ncoghlan@gmail.com wrote:
On 31 October 2017 at 05:16, RonnyPfannschmidt < opensource@ronnypfannschmidt.de> wrote:
Hi everyone,
since a while now various details of installing python packages in virtualenvs caused me grief
a) typically each tox folder in a project is massive, and has a lot of duplicate files, recreating them, managing and iterating them takes quite a while b) for nicely separated deployments, each virtualenv for an application takes a few hundred megabytes - that quickly can saturate disk space even if a reasonable amount was reserved c) installation and recreation of virtualenvs with the same set of packages takes quite a while (even with pip caches this is slow, and there is no good reason to avoid making it completely instantaneous)
Those are issues that buildout has solved long before pip was even around, but they rely on sys.path expansion that Ronny found objectionable due to performance issues.
I don't think the performance issues are that problematic (and wasn't there some work on Python 3 that made import faster even with long sys.paths?).
[...]
However, there's another approach that specifically tackles the content duplication problem, which would require a new installation layout as you suggest, but could still rely on *.pth files to make it implicitly compatible with existing packages and applications and existing Python runtime versions.
That approach is to create an install tree somewhere that looks like this:
_shared-packages/ <normalised-package-name>/ <release-version>/ <version-details>.dist-info/ <installed-files>
Instead of installing full packages directly into a venv the way pip does, an installer that worked this way would instead manage a <normalised-package-name>.pth file that indicated "_shared-packages/<normalised-package-name>/<release-version>" should be added to sys.path.
This solution is nice, but preserves the long sys.path that Ronny wanted to avoid in the first place.
Another detail that needs mentioning is that, for .pth based sys.path manipulation to work, the <installed-files> would need to be all the files from purelib and platlib directories from wheels mashed together instead of a simple unpacking of the wheel (though I guess the .pth file could add both purelib and platlib subfolders to sys.path...)
Another possibility that avoids the issue of long.syspath is to use this layout but with symlink farms instead of either sys.path manipulation or conda-like hard-linking.
Symlinks would preserve better filesystem size visibility that Ronny wanted while allowing the layout above to contain wheels that were simply unzipped.
In Windows, where symlinks require admin privileges (though this is changing https://blogs.windows.com/buildingapps/2016/12/02/symlinks-windows-10/), an option could be provided for using hard links instead (which never require elevated privileges).
Using symlinks into the above layout preserves all advantages and drawbacks Nick mentioned other than the sys.path expansion.
Regards,
Leo

On 31 October 2017 at 22:13, Leonardo Rochael Almeida leorochael@gmail.com wrote:
Those are issues that buildout has solved long before pip was even around, but they rely on sys.path expansion that Ronny found objectionable due to performance issues.
The combination of network drives and lots of sys.path entries could lead to *awful* startup times with the old stat-based import model (which Python 2.7 still uses by default).
The import system in Python 3.3+ relies on cached os.listdir() results instead, and after we switched to that, we received at least one report from a HPC operator of batch jobs that used to take 100+ seconds to start when importing modules from NFS dropped down to startup times measured in hundreds of milliseconds - most of the time was previously being lost to network round trips for failed stat calls that just reported that the file didn't exist. Even on spinning disks, the new import system gained back most of the speed that was lost in the switch from low level C to more maintainable and portable Python code.
An org that runs large rendering farms also reported significantly improving their batch job startup times in 2.7 by switching to importlib2 (which backports the Py3 import implementation).
I don't think the performance issues are that problematic (and wasn't there some work on Python 3 that made import faster even with long sys.paths?).
As soon as you combined the old import model with network drives, your startup times could quickly become intolerable, even with short sys.path entries - failing imports, and imports that get satisfied later in the path just end up taking too long.
I wouldn't call it a *completely* solved problem in Py3 (there are still some application startup related activities that scale linearly with the length of sys.path), but the worst offender (X stat calls by Y sys.path entries, taking Z milliseconds per call) is gone.
On 31 October 2017 at 05:22, Nick Coghlan ncoghlan@gmail.com wrote:
[...]
However, there's another approach that specifically tackles the content duplication problem, which would require a new installation layout as you suggest, but could still rely on *.pth files to make it implicitly compatible with existing packages and applications and existing Python runtime versions.
That approach is to create an install tree somewhere that looks like this:
_shared-packages/ <normalised-package-name>/ <release-version>/ <version-details>.dist-info/ <installed-files>
Instead of installing full packages directly into a venv the way pip does, an installer that worked this way would instead manage a <normalised-package-name>.pth file that indicated "_shared-packages/<normalised-package-name>/<release-version>" should be added to sys.path.
This solution is nice, but preserves the long sys.path that Ronny wanted to avoid in the first place.
Another detail that needs mentioning is that, for .pth based sys.path manipulation to work, the <installed-files> would need to be all the files from purelib and platlib directories from wheels mashed together instead of a simple unpacking of the wheel (though I guess the .pth file could add both purelib and platlib subfolders to sys.path...)
Virtual environments already tend to mash those file types together anyway - it's mainly Linux system packages that separate them out.
Another possibility that avoids the issue of long.syspath is to use this layout but with symlink farms instead of either sys.path manipulation or conda-like hard-linking.
Symlinks would preserve better filesystem size visibility that Ronny wanted while allowing the layout above to contain wheels that were simply unzipped.
Yeah, one thing I really like about that install layout is that it separates the question of "the installed package layout" from how that package gets linked into a virtual environment. If you're only doing exact version matches, then you can use symlinks quite happily, since you don't need to cope with the name of the "dist-info" directory changing. However, if you're going to allow for transparent maintenance updates (and hence version number changes in the dist-info directory name), then you need a *.pth file.
In Windows, where symlinks require admin privileges (though this is changing https://blogs.windows.com/buildingapps/2016/12/02/symlinks-windows-10/), an option could be provided for using hard links instead (which never require elevated privileges).
Huh, interesting - I never knew that Windows offered unprivileged hard link support. I wonder if the venv module could be updated to offer that as an alternative to copying when symlinks aren't available.
Cheers, Nick.
participants (8)
-
Chris Barker - NOAA Federal
-
Freddy Rietdijk
-
Leonardo Rochael Almeida
-
Nick Coghlan
-
Ronny Pfannschmidt
-
Ronny Pfannschmidt
-
RonnyPfannschmidt
-
Thomas Kluyver