[Numpy-discussion] seeking advice on a fast string->array conversion

Tue Nov 16 11:46:19 EST 2010

On 11/16/10 7:31 AM, Darren Dale wrote:
> On Tue, Nov 16, 2010 at 9:55 AM, Pauli Virtanen<pav at iki.fi>  wrote:
>> Tue, 16 Nov 2010 09:41:04 -0500, Darren Dale wrote:
>> [clip]
>>> That loop takes 0.33 seconds to execute, which is a good start. I need
>>> some help converting this example to return an actual numpy array. Could
>>> anyone please offer a suggestion?

Darren,

It's interesting that you found fromstring() so slow -- I've put some 
time into trying to get fromfile() and fromstring() to be a bit more 
robust and featurefull, but found it to be some really painful code to 
work on -- but it didn't dawn on my that it would be slow too! I saw all 
the layers of function calls, but I still thought that would be minimal 
compared to the actual string parsing. I guess not. Shows that you never 
know where your bottlenecks are without profiling.

"Slow" is relative, of course, but since the whole point of 
fromfile/string is performance (otherwise, we'd just parse with python), 
it would be nice to get them as fast as possible.

I had been thinking that the way to make a good fromfile was Cython, so 
you've inspired me to think about it some more. Would you be interested 
in extending what you're doing to a more general purpose tool?

Anyway,  a comment or two:
> cdef extern from 'stdlib.h':
>      double atof(char*)

One thing I found with the current numpy code is that the use of the 
ato* functions is a source of a lot of bugs (all of them?) the core 
problem is error handling -- you have to do a lot of pointer checking to 
see if a call was successful, and with the fromfile code, that error 
handling is not done in all the layers of calls.

Anyone know what the advantage of ato* is over scanf()/fscanf()?

Also, why are you doing string parsing rather than parsing the files 
directly, wouldn't that be a bit faster?

I've got some C extension code for simple parsing of text files into 
arrays of floats or doubles (using fscanf). I'd be curious how the 
performance compares to what you've got. Let me know if you're interested.

-Chris

> def test():
>      py_string = '100'
>      cdef char* c_string = py_string
>      cdef int i, j
>      cdef double val
>      i = 0
>      j = 2048*1200
>      cdef np.ndarray[np.float64_t, ndim=1] ret
>
>      ret_arr = np.empty((2048*1200,), dtype=np.float64)
>      ret = ret_arr
>
>      d = time.time()
>      while i<j:
>          c_string = py_string
>          ret[i] = atof(c_string)
>          i += 1
>      ret_arr.shape = (1200, 2048)
>      print ret_arr, ret_arr.shape, time.time()-d
>
> The loop now takes only 0.11 seconds to execute. Thanks again.
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov