[Distutils] [proposal] shared distribution installations

Nick Coghlan ncoghlan at gmail.com
Tue Oct 31 09:47:52 EDT 2017

On 31 October 2017 at 22:13, Leonardo Rochael Almeida <leorochael at gmail.com>

> Those are issues that buildout has solved long before pip was even around,
> but they rely on sys.path expansion that Ronny found objectionable due to
> performance issues.

The combination of network drives and lots of sys.path entries could lead
to *awful* startup times with the old stat-based import model (which Python
2.7 still uses by default).

The import system in Python 3.3+ relies on cached os.listdir() results
instead, and after we switched to that, we received at least one report
from a HPC operator of batch jobs that used to take 100+ seconds to start
when importing modules from NFS dropped down to startup times measured in
hundreds of milliseconds - most of the time was previously being lost to
network round trips for failed stat calls that just reported that the file
didn't exist. Even on spinning disks, the new import system gained back
most of the speed that was lost in the switch from low level C to more
maintainable and portable Python code.

An org that runs large rendering farms also reported significantly
improving their batch job startup times in 2.7 by switching to importlib2
(which backports the Py3 import implementation).

> I don't think the performance issues are that problematic (and wasn't
> there some work on Python 3 that made import faster even with long
> sys.paths?).

As soon as you combined the old import model with network drives, your
startup times could quickly become intolerable, even with short sys.path
entries - failing imports, and imports that get satisfied later in the path
just end up taking too long.

I wouldn't call it a *completely* solved problem in Py3 (there are still
some application startup related activities that scale linearly with the
length of sys.path), but the worst offender (X stat calls by Y sys.path
entries, taking Z milliseconds per call) is gone.

> On 31 October 2017 at 05:22, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> [...]
>> However, there's another approach that specifically tackles the content
>> duplication problem, which would require a new installation layout as you
>> suggest, but could still rely on *.pth files to make it implicitly
>> compatible with existing packages and applications and existing Python
>> runtime versions.
>> That approach is to create an install tree somewhere that looks like this:
>>     _shared-packages/
>>         <normalised-package-name>/
>>             <release-version>/
>>                 <version-details>.dist-info/
>>                 <installed-files>
>> Instead of installing full packages directly into a venv the way pip
>> does, an installer that worked this way would instead manage a
>> <normalised-package-name>.pth file that indicated
>> "_shared-packages/<normalised-package-name>/<release-version>" should be
>> added to sys.path.
> This solution is nice, but preserves the long sys.path that Ronny wanted
> to avoid in the first place.
> Another detail that needs mentioning is that, for .pth based sys.path
> manipulation to work, the <installed-files> would need to be all the files
> from purelib and platlib directories from wheels mashed together instead of
> a simple unpacking of the wheel (though I guess the .pth file could add
> both purelib and platlib subfolders to sys.path...)

Virtual environments already tend to mash those file types together anyway
- it's mainly Linux system packages that separate them out.

> Another possibility that avoids the issue of long.syspath is to use this
> layout but with symlink farms instead of either sys.path manipulation or
> conda-like hard-linking.
> Symlinks would preserve better filesystem size visibility that Ronny
> wanted while allowing the layout above to contain wheels that were simply
> unzipped.

Yeah, one thing I really like about that install layout is that it
separates the question of "the installed package layout" from how that
package gets linked into a virtual environment. If you're only doing exact
version matches, then you can use symlinks quite happily, since you don't
need to cope with the name of the "dist-info" directory changing. However,
if you're going to allow for transparent maintenance updates (and hence
version number changes in the dist-info directory name), then you need a
*.pth file.

> In Windows, where symlinks require admin privileges (though this is
> changing
> <https://blogs.windows.com/buildingapps/2016/12/02/symlinks-windows-10/>),
> an option could be provided for using hard links instead (which never
> require elevated privileges).

Huh, interesting - I never knew that Windows offered unprivileged hard link
support. I wonder if the venv module could be updated to offer that as an
alternative to copying when symlinks aren't available.


Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20171031/bdf7ec25/attachment-0001.html>

More information about the Distutils-SIG mailing list