[Numpy-discussion] fromstring() is slow, no really!

Anthony Scopatz scopatz at gmail.com
Sun May 13 19:28:30 EDT 2012


Hello All,

This week, while doing some optimization, I found that np.fromstring()
is significantly slower than many alternatives out there.  This function
basically does two things: (1) it splits the string and (2) it converts the
data to the desired type.

There isn't much we can do about the conversion/casting so what I
mean is that the *string splitting implementation is slow*.

To simplify the discussion, I will just talk about string to 1d float64
arrays.
I have also issued pull request #279 [1] to numpy with some sample code.
Timings can be seen in the ipython notebook here.

It turns out that using str.split() and np.array() are 20 - 35% faster,
which
was non-intuitive to me.  That is to say:

rawdata = s.split()
data = np.array(rawdata, dtype=float)


is faster than

data = np.fromstring(s, sep=" ", dtype=float)


The next thing to try, naturally, was Cython.  This did not change the
timings much for these two  strategies.  However, being in Cython
allows us to call atof() directly.  My implementation is based on a
previous
thread on this topic [2].   However, in the example in [2], the string was
hard coded, contained only one data value, and did not need to be split.
Thus they saw a dramatic 10x speed boost.   To deal with the more
realistic case, I first just continued to use str.split().  This took 35 -
50%
less time than np.fromstring().

Finally, using the strtok() function in the C standard library to call
atof()
while we tokenize the string further reduces the speed 50 - 60% of the
baseline np.fromstring() time.

Timings
------------
In [1]: import fromstr

In [2]: s = "100.0 " * 100000

In [3]: timeit fromstr.fromstring(s)
10 loops, best of 3: 20.7 ms per loop

In [4]: timeit fromstr.split_and_array(s)
100 loops, best of 3: 16.1 ms per loop

In [6]: timeit fromstr.split_atof(s)
100 loops, best of 3: 13.5 ms per loop

In [7]: timeit fromstr.token_atof(s)
100 loops, best of 3: 8.35 ms per loop

Possible Explanation
----------------------------------
Numpy's fromstring() function may be found here [3].  However, this code
is a bit hard to follow but it uses the array_from_text() function [4].  On
the
other hand str.split() [5] uses a macro function SPLIT_ADD().   The
difference
between these is that I believe that str.split() over-allocates the size of
the
list in a more aggressive way than array_from_text().  This leads to fewer
resizes and thus fewer memory copies.

This would also explain why the tokenize implementation is the fastest
since
this pre-allocates the maximum possible array size and then slices it down.
No resizes are present in this function, though it requires more memory up
front.

Summary (tl;dr)
------------------------
The np.fromstring() is slow in the mechanism it chooses to split strings
by.
This is likely due to how many resize operations it must perform.  While it
need not be the* *fastest* *thing out there, it should probably be at least
as
fast at Python string splitting.

No pull-request 'fixing' this issue was provided because I wanted to see
what people thought and if / which option is worth pursuing.

Be Well
Anthony

[1] https://github.com/numpy/numpy/pull/279
[2] http://comments.gmane.org/gmane.comp.python.numeric.general/41504
[3]
https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3699
[4]
https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3418
[5]
http://svn.python.org/view/python/tags/r271/Objects/stringlib/split.h?view=markup
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120513/5216c2ba/attachment.html>


More information about the NumPy-Discussion mailing list