[Distutils] namespace packages

David Cournapeau cournape at gmail.com
Fri Apr 23 10:30:50 CEST 2010


On Fri, Apr 23, 2010 at 4:51 PM, Tarek Ziadé <ziade.tarek at gmail.com> wrote:
> On Fri, Apr 23, 2010 at 9:23 AM, David Cournapeau <cournape at gmail.com> wrote:
>> On Fri, Apr 23, 2010 at 2:03 PM, P.J. Eby <pje at telecommunity.com> wrote:
>>> At 10:16 AM 4/23/2010 +0900, David Cournapeau wrote:
>>>>
>>>> In my case, it is not even the issue of many eggs (I always install
>>>> things with --single-version-externally-managed and I forbid any code
>>>> to write into  easy_install.pth). Importing pkg_resources alone
>>>> (python -c "import pkg_resources") takes half a second on my netbook.
>>>
>>> I find that weird, to say the least.  On my desktop just now, with a
>>> sys.path 79 entries long (including 41 .eggs), it's a "blink and you missed
>>> it" operation.  I'm curious what the difference might be.
>>>
>>> (Running timeit -s 'import pkg_resources' 'reload(pkg_resources)' gives a
>>> timing result of 61.9 milliseconds for me.)
>>
>> I should re-emphasize that the half-second number was on a netbook,
>> which is a very weak machine on every account (CPU, memory size and
>> disk capabilities). But using pkg_resources for console_scripts in the
>> package I am working on made a big difference (more time in spent in
>> importing pkg_resources than everything else). Since we are talking
>> about import times, I guess the issue is the same as for namespace
>> packages. I have noticed this slow behavior on every machine I have
>> ever had my hands on, be it mine or someone else, on linux, windows or
>> mac os x.
>>
>> My (limited) understanding of pkg_resources is that is that it scales
>> linearly with the number of packages it is aware of, and that it needs
>> to scan a few directories for every package. Importing pkg_resources
>> causes many more syscalls than relatively big packages (~ 1000 for
>> python -c "", 3000 for importing one of numpy/wx/gtk, 6000 for
>> pkg_resources). Assuming those are unavoidable (and the current
>> namespace implementation in setuptools requires it, right ?), I don't
>> see a way to reduce that cost significantly,
>
> There's a memory cache though, that probably makes it faster already.

Sure - I should have mentioned all those approximate numbers are given
for the hot cache. In my experience, unless some package do
CPU-intensive stuff at import time, the cost is mostly a function of
syscall numbers (the correlation between syscall and import time is
pretty strong) and regex. The profile_import tool (available from the
bzr project) is quite useful to detect those cases.

> Now if we had a way to know that a directory tree hasn't changed on
> the system, a
> persistent cache will dramatically increase the work. Unfortunately I think
> this is impossible unless we watch them (and yet,  this would be quite
> hard to implement).

This all sounds complicated. pkg_resources is already complicated
enough (the other big reason why I don't use it in any of my
packages). Without a clear specification of what pkg_resources or its
successor is doing, caching and C-based implementation sound like
premature optimization to me. For example, pkg_resources does a lot of
things which seem quite orthogonal to me. While scanning every package
may be needed for namespace package support the "setuptools" way, this
is almost never useful for locating resources at runtime, but you pay
the price in both cases.

> For regular directories, I haven't profiled it, but the bottleneck is
> probably find_on_path(), the function that gets called for every
> directory in sys.path to look for .eggs.
>
> Now since the code mostly deals with strings work besides the I/O,
> maybe it could be reimplemented in C.

cheers,

David


More information about the Distutils-SIG mailing list