
Apologies, I accidentally hit send...
On Tue, Nov 16, 2010 at 9:20 AM, Darren Dale dsdale24@gmail.com wrote:
I am wrapping up a small package to parse a particular ascii-encoded file format generated by a program we use heavily here at the lab. (In the unlikely event that you work at a synchrotron, and use Certified Scientific's "spec" program, and are actually interested, the code is currently available at https://github.com/darrendale/praxes/tree/specformat/praxes/io/spec/ .)
I have been benchmarking the project against another python package developed by a colleague, which is an extension module written in pure C. My python/cython project takes about twice as long to parse and index a file (~0.8 seconds for 100MB), which is acceptable. However, actually converting ascii strings to numpy arrays, which is done using numpy.fromstring, takes a factor of 10 longer than the extension module. So I am wondering about the performance of np.fromstring:
import time import numpy as np s = b'1 ' * 2048 *1200 d = time.time() x = np.fromstring(s, dtype='d', sep=b' ') print time.time() - d
That takes about 1.3 seconds on my machine. A similar metric for the extension module is to load 1200 of these 2048-element arrays from the file:
d=time.time() x=[s.mca(i+1) for i in xrange(1200)] print time.time()-d
That takes about 0.127 seconds on my machine. This discrepancy is unacceptable for my usecase, so I need to develop an alternative to fromstring. Here is bit of testing with cython:
import time
cdef extern from 'stdlib.h': double atof(char*)
py_string = '100' cdef char* c_string = py_string cdef int i, j j=2048*1200
d = time.time() while i<j: c_string = py_string val = atof(c_string) i += 1 print val, time.time()-d
That loop takes 0.33 seconds to execute, which is a good start. I need some help converting this example to return an actual numpy array. Could anyone please offer a suggestion?
Thanks, Darren