[Numpy-discussion] "import numpy" is slow

Wed Jul 30 20:07:37 EDT 2008

On Jul 30, 2008, at 10:59 PM, Stéfan van der Walt wrote:
> I.e. most people don't start up NumPy all the time -- they import
> NumPy, and then do some calculations, which typically take longer than
> the import time.

Is that interactively, or is that through programs?

> For a benefit of 0.03s, I don't think it's worth it.

The final number with all the hundredths of a second added up to 0.08  
seconds, which was about 30% of the 'import numpy' cost.

> Numpy has a very flat namespace, for better or worse, which implies
> many imports.

I don't get the feeling that numpy is flat.  Python's stdlib is flat.  
Numpy has many 2- and 3-level modules.

>> Is the numpy recommendation that people should do:
>>
>>   import numpy
>>   numpy.fft.ifft(data)
>
> That's the way many people use it.

The normal Python way is:

  from numpy import fft
   fft.ifft(data)

because in most packages, parent modules don't import all of their  
children.  I acknowledge that existing numpy code will break with my  
desired change, as this example from the tutorial

   import numpy
   import pylab
   # Build a vector of 10000 normal deviates with variance 0.5^2 and  
mean 2
   mu, sigma = 2, 0.5
   v = numpy.random.normal(mu,sigma,10000)

and I am not saying to change this code.  Instead, I am asking for  
limits on the eagerness, with a long-term goal of minimizing its use.

>> Why is [ctypeslib] so important that it should be in the top-
>> level namespace?
>
> It's a single Python file -- does it make much of a difference?

The file imports other files.  Here's the import chain:

  ctypeslib: 0.047 (numpy)
   ctypes: -1.000 (ctypeslib)
    _ctypes: 0.003 (ctypes)
    gestalt: -1.000 (ctypes)
    ma: 0.005 (numpy)
     extras: 0.001 (ma)
      numpy.lib.index_tricks: 0.000 (extras)
      numpy.lib.polynomial: 0.000 (extras)

(The "-1.000" indicates a bug in my instrumentation script, which I  
worked around with a -1.0 value.)

Every numpy program, because it eagerly imports 'ctypeslib' to make  
it be accessible as a top-level variable, ends up importing ctypes.

 >>> if 1:
...   t1 = time.time()
...   import ctypes
...   t2 = time.time()
...
 >>> t2-t1
0.032159090042114258

That's 10% of the import time.

>> In my opinion, this assistance is counter to standard practice in
>> effectively every other Python package.  I don't see the benefit.
>
> How do you propose we change this?

If I had my way, remove things like (in numpy/__init__.py)

     import linalg
     import fft
     import random
     import ctypeslib
     import ma

but leave the list of submodules in "__all__" so that "from numpy  
import *" works.  Perhaps add a top-level function to 'import_all()'  
which mimics the current behavior, and have iPython know about it so  
interactive users get it automatically.  Or something like that.

Yes, I know the numpy team won't change this behavior.  I want to  
know why you all will consider changing.

Something more concrete: change the top-level definitions in 'numpy'  
from

     from testing import Tester
     test = Tester().test
     bench = Tester().bench

with

def test(label='fast', verbose=1, extra_argv=None, doctests=False,
              coverage=False, **kwargs):
   from testing import Tester
   Tester.test(label, verbose, extra_argv, doctests, coverage, **kwargs

and do something similar for 'bench'.  Note that numpy currently  
implements

   numpy.test  <-- this is a Tester().test
   numpy.testing.test <-- another Tester().test bound method

so there's some needless and distracting, but extremely minor,  
duplication.

>> Getting rid of these functions, and thus getting rid of the import
>> speeds numpy startup time by 3.5%.
>
> While I appreciate you taking the time to find these niggles, but we
> are short on developer time as it is.  Asking them to spend their
> precious time on making a 3.5% improvement in startup time does not
> make much sense.  If you provide a patch, on the other hand, it would
> only take a matter of seconds to decide whether to apply or not.
> You've already done most of the sleuth work.

I wrote that I don't know the reasons for why the design was as it  
is.  Are those functions ("english_upper", "english_lower",  
"english_capitalize") expected as part of the public interface for  
the module?  The lack of a "_" prefix and their verbose docstrings  
implies that they are for general use.  In that case, they can't  
easily be gotten rid of.  Yet it doesn't make sense for them to be  
part of 'numerictypes'.

Why would I submit a patch if there's no way those definitions will  
disappear, for reasons I am not aware of?

I am not asking you all to make these changes.  I'm asking about how  
much change is acceptable, what are the restrictions, and why are  
they there?

I also haven't yet figured out how to get the regression tests to  
run, and I'm not going to contribute patches without at least passing  
that bare minimum.  BTW, how do I do that?  In the top-level there's  
a 'test.sh' command but when I run it I get:

% mkdir tmp
% bash test.sh
Running from numpy source directory.
Traceback (most recent call last):
   File "setupscons.py", line 56, in <module>
     raise DistutilsError('\n'.join(msg))
distutils.errors.DistutilsError: You cannot build numpy with scons  
without the numscons package
(Failure was: No module named numscons)
test.sh: line 11: cd: /Users/dalke/cvses/numpy/tmp: No such file or  
directory

and when I run 'nosetests' in the top-level directory I get:

ImportError: Error importing numpy: you should not try to import  
numpy from
         its source directory; please exit the numpy source tree, and  
relaunch
         your python intepreter from there.

I couldn't find (in a cursory search) instructions for running self- 
tests or regression tests.

>> I could probably get another 0.05 seconds if I dug around more, but I
>> can't without knowing what use case numpy is trying to achieve.  Why
>> are all those ancillary modules (testing, ctypeslib) eagerly loaded
>> when there seems no need for that feature?
>
> Need is relative.  You need fast startup time, but most of our users
> need quick access to whichever functions they want (and often use from
> an interactive terminal).  I agree that "testing" and "ctypeslib" do
> not belong in that category, but they don't seem to do much harm
> either.

If there is no need for those features then I'll submit a patch to  
remove them.

There is some need, and there are many ways to handle that need.  The  
current solution in numpy is to import everything.  Again I ask, does  
*everything* (like 'testing' and 'ctypeslib') need to be imported  
eagerly?  In your use case of user-driven exploratory development the  
answer is no - the users described above rarely desire access to  
those package because those packages are best used in automated  
environments.  Eg, why write tests which are only used once?

				Andrew
				dalke at dalkescientific.com