[Numpy-discussion] fromstring() is slow, no really!
Chris Barker
chris.barker at noaa.gov
Thu May 17 11:13:55 EDT 2012
Anthony,
Thanks for looking into this. A few other notes about fromstring() (
and fromfile() ).
Frankly they haven't gotten much love -- they are, as you have seen,
less than optimized, and kind of buggy (actually, not really buggy,
but not robust in the face of malformed input -- and they give results
that are wrong in some cases (rather throwing an error, for instance).
So they realy do need some attention.
On the other hand -- folks are working on various ways to optimize
reading data from text files (and maybe strings) so that may be a
better way to go.
If you google "fromstring barker numpy" you'll find a thread or too
with what I learned, and pointers to a couple tickets. What I do
remember:
The use of atof and friends is complicated because there are python
version that extend the C lib versions, and numpy versions that extend
those (for better NaN handling, for instance).
the source of the lack of robustness stems from the fact that the
error checking is not done right when calling atof and friends -- i.e.
you need to check if the pointer was incrememnted to see if it
successfully read a value. With the layered calls to numpy and python
versions, I found it very hard to fix this.
Profile carefully to check your theory about limited over-allocation
of memory being the source of the performance issues -- when i've
tested similar code, it made little difference -- allocating and
copying memory is actually pretty fast. If you re-allocate an copy
every single append, it's slow, yes, but I found virtually no
difference between over-allocating say 10% or 50% (not sure what the
bottom reasonable value was there)
Good luck,
-Chris
On Sun, May 13, 2012 at 4:28 PM, Anthony Scopatz <scopatz at gmail.com> wrote:
> Hello All,
>
> This week, while doing some optimization, I found that np.fromstring()
> is significantly slower than many alternatives out there. This function
> basically does two things: (1) it splits the string and (2) it converts the
> data to the desired type.
>
> There isn't much we can do about the conversion/casting so what I
> mean is that the string splitting implementation is slow.
>
> To simplify the discussion, I will just talk about string to 1d float64
> arrays.
> I have also issued pull request #279 [1] to numpy with some sample code.
> Timings can be seen in the ipython notebook here.
>
> It turns out that using str.split() and np.array() are 20 - 35% faster,
> which
> was non-intuitive to me. That is to say:
>
> rawdata = s.split()
> data = np.array(rawdata, dtype=float)
>
>
> is faster than
>
> data = np.fromstring(s, sep=" ", dtype=float)
>
>
> The next thing to try, naturally, was Cython. This did not change the
> timings much for these two strategies. However, being in Cython
> allows us to call atof() directly. My implementation is based on a
> previous
> thread on this topic [2]. However, in the example in [2], the string was
> hard coded, contained only one data value, and did not need to be split.
> Thus they saw a dramatic 10x speed boost. To deal with the more
> realistic case, I first just continued to use str.split(). This took 35 -
> 50%
> less time than np.fromstring().
>
> Finally, using the strtok() function in the C standard library to call
> atof()
> while we tokenize the string further reduces the speed 50 - 60% of the
> baseline np.fromstring() time.
>
> Timings
> ------------
> In [1]: import fromstr
>
> In [2]: s = "100.0 " * 100000
>
> In [3]: timeit fromstr.fromstring(s)
> 10 loops, best of 3: 20.7 ms per loop
>
> In [4]: timeit fromstr.split_and_array(s)
> 100 loops, best of 3: 16.1 ms per loop
>
> In [6]: timeit fromstr.split_atof(s)
> 100 loops, best of 3: 13.5 ms per loop
>
> In [7]: timeit fromstr.token_atof(s)
> 100 loops, best of 3: 8.35 ms per loop
>
> Possible Explanation
> ----------------------------------
> Numpy's fromstring() function may be found here [3]. However, this code
> is a bit hard to follow but it uses the array_from_text() function [4]. On
> the
> other hand str.split() [5] uses a macro function SPLIT_ADD(). The
> difference
> between these is that I believe that str.split() over-allocates the size of
> the
> list in a more aggressive way than array_from_text(). This leads to fewer
> resizes and thus fewer memory copies.
>
> This would also explain why the tokenize implementation is the fastest
> since
> this pre-allocates the maximum possible array size and then slices it down.
> No resizes are present in this function, though it requires more memory up
> front.
>
> Summary (tl;dr)
> ------------------------
> The np.fromstring() is slow in the mechanism it chooses to split strings by.
>
> This is likely due to how many resize operations it must perform. While it
> need not be the *fastest* thing out there, it should probably be at least as
> fast at Python string splitting.
>
> No pull-request 'fixing' this issue was provided because I wanted to see
> what people thought and if / which option is worth pursuing.
>
> Be Well
> Anthony
>
> [1] https://github.com/numpy/numpy/pull/279
> [2] http://comments.gmane.org/gmane.comp.python.numeric.general/41504
> [3] https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3699
> [4] https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L3418
> [5] http://svn.python.org/view/python/tags/r271/Objects/stringlib/split.h?view=markup
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
More information about the NumPy-Discussion
mailing list