seeking advice on a fast string->array conversion

I am wrapping up a small package to parse a particular ascii-encoded file format generated by a program we use heavily here at the lab. (In the unlikely event that you work at a synchrotron, and use Certified Scientific's "spec" program, and are actually interested, the code is currently available at https://github.com/darrendale/praxes/tree/specformat/praxes/io/spec/ .) I have been benchmarking the project against another python package developed by a colleague, which is an extension module written in pure C. My python/cython project takes about twice as long to parse and index a file (~0.8 seconds for 100MB), which is acceptable. However, actually converting ascii strings to numpy arrays, which is done using numpy.fromstring, takes a factor of 10 longer than the extension module. So I am wondering about the performance of np.fromstring: import time import numpy as np s = b'1 ' * 2048 *1200 d = time.time() x = np.fromstring(s) print time.time() - d

Actually, I do use spec when I have synchotron experiments. But why are your files so large? On Nov 16, 2010 9:20 AM, "Darren Dale" <dsdale24@gmail.com> wrote:
I am wrapping up a small package to parse a particular ascii-encoded file format generated by a program we use heavily here at the lab. (In the unlikely event that you work at a synchrotron, and use Certified Scientific's "spec" program, and are actually interested, the code is currently available at https://github.com/darrendale/praxes/tree/specformat/praxes/io/spec/ .)
I have been benchmarking the project against another python package developed by a colleague, which is an extension module written in pure C. My python/cython project takes about twice as long to parse and index a file (~0.8 seconds for 100MB), which is acceptable. However, actually converting ascii strings to numpy arrays, which is done using numpy.fromstring, takes a factor of 10 longer than the extension module. So I am wondering about the performance of np.fromstring:
import time import numpy as np s = b'1 ' * 2048 *1200 d = time.time() x = np.fromstring(s) print time.time() - d _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Sorry, I accidentally hit send long before I was finished writing. But to answer your question, they contain many 2048-element multi-channel analyzer spectra. Darren On Tue, Nov 16, 2010 at 9:26 AM, william ratcliff <william.ratcliff@gmail.com> wrote:
Actually, I do use spec when I have synchotron experiments. But why are your files so large?
On Nov 16, 2010 9:20 AM, "Darren Dale" <dsdale24@gmail.com> wrote:
I am wrapping up a small package to parse a particular ascii-encoded file format generated by a program we use heavily here at the lab. (In the unlikely event that you work at a synchrotron, and use Certified Scientific's "spec" program, and are actually interested, the code is currently available at https://github.com/darrendale/praxes/tree/specformat/praxes/io/spec/ .)
I have been benchmarking the project against another python package developed by a colleague, which is an extension module written in pure C. My python/cython project takes about twice as long to parse and index a file (~0.8 seconds for 100MB), which is acceptable. However, actually converting ascii strings to numpy arrays, which is done using numpy.fromstring, takes a factor of 10 longer than the extension module. So I am wondering about the performance of np.fromstring:
import time import numpy as np s = b'1 ' * 2048 *1200 d = time.time() x = np.fromstring(s) print time.time() - d _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Tue, 16 Nov 2010 09:20:29 -0500, Darren Dale wrote: [clip]
module. So I am wondering about the performance of np.fromstring:
Fromstring is slow, probably because it must work around locale- dependence of the underlying C parsing functions. Moreover, the Numpy parsing mechanism generates many indirect calls. -- Pauli Virtanen

Apologies, I accidentally hit send... On Tue, Nov 16, 2010 at 9:20 AM, Darren Dale <dsdale24@gmail.com> wrote:
I am wrapping up a small package to parse a particular ascii-encoded file format generated by a program we use heavily here at the lab. (In the unlikely event that you work at a synchrotron, and use Certified Scientific's "spec" program, and are actually interested, the code is currently available at https://github.com/darrendale/praxes/tree/specformat/praxes/io/spec/ .)
I have been benchmarking the project against another python package developed by a colleague, which is an extension module written in pure C. My python/cython project takes about twice as long to parse and index a file (~0.8 seconds for 100MB), which is acceptable. However, actually converting ascii strings to numpy arrays, which is done using numpy.fromstring, takes a factor of 10 longer than the extension module. So I am wondering about the performance of np.fromstring:
import time import numpy as np s = b'1 ' * 2048 *1200 d = time.time() x = np.fromstring(s, dtype='d', sep=b' ') print time.time() - d That takes about 1.3 seconds on my machine. A similar metric for the extension module is to load 1200 of these 2048-element arrays from the file: d=time.time() x=[s.mca(i+1) for i in xrange(1200)] print time.time()-d That takes about 0.127 seconds on my machine. This discrepancy is unacceptable for my usecase, so I need to develop an alternative to fromstring. Here is bit of testing with cython: import time cdef extern from 'stdlib.h': double atof(char*) py_string = '100' cdef char* c_string = py_string cdef int i, j j=2048*1200 d = time.time() while i<j: c_string = py_string val = atof(c_string) i += 1 print val, time.time()-d That loop takes 0.33 seconds to execute, which is a good start. I need some help converting this example to return an actual numpy array. Could anyone please offer a suggestion? Thanks, Darren

Tue, 16 Nov 2010 09:41:04 -0500, Darren Dale wrote: [clip]
That loop takes 0.33 seconds to execute, which is a good start. I need some help converting this example to return an actual numpy array. Could anyone please offer a suggestion?
Easiest way is probably to use ndarray buffers and resize them when needed. For example: https://github.com/pv/scipy-work/blob/enh/interpnd-smooth/scipy/spatial/qhul...

On Tue, Nov 16, 2010 at 9:55 AM, Pauli Virtanen <pav@iki.fi> wrote:
Tue, 16 Nov 2010 09:41:04 -0500, Darren Dale wrote: [clip]
That loop takes 0.33 seconds to execute, which is a good start. I need some help converting this example to return an actual numpy array. Could anyone please offer a suggestion?
Easiest way is probably to use ndarray buffers and resize them when needed. For example:
https://github.com/pv/scipy-work/blob/enh/interpnd-smooth/scipy/spatial/qhul...
Thank you Pauli. That makes it *incredibly* simple: import time cimport numpy as np import numpy as np cdef extern from 'stdlib.h': double atof(char*) def test(): py_string = '100' cdef char* c_string = py_string cdef int i, j cdef double val i = 0 j = 2048*1200 cdef np.ndarray[np.float64_t, ndim=1] ret ret_arr = np.empty((2048*1200,), dtype=np.float64) ret = ret_arr d = time.time() while i<j: c_string = py_string ret[i] = atof(c_string) i += 1 ret_arr.shape = (1200, 2048) print ret_arr, ret_arr.shape, time.time()-d The loop now takes only 0.11 seconds to execute. Thanks again.

On 11/16/10 7:31 AM, Darren Dale wrote:
On Tue, Nov 16, 2010 at 9:55 AM, Pauli Virtanen<pav@iki.fi> wrote:
Tue, 16 Nov 2010 09:41:04 -0500, Darren Dale wrote: [clip]
That loop takes 0.33 seconds to execute, which is a good start. I need some help converting this example to return an actual numpy array. Could anyone please offer a suggestion?
Darren, It's interesting that you found fromstring() so slow -- I've put some time into trying to get fromfile() and fromstring() to be a bit more robust and featurefull, but found it to be some really painful code to work on -- but it didn't dawn on my that it would be slow too! I saw all the layers of function calls, but I still thought that would be minimal compared to the actual string parsing. I guess not. Shows that you never know where your bottlenecks are without profiling. "Slow" is relative, of course, but since the whole point of fromfile/string is performance (otherwise, we'd just parse with python), it would be nice to get them as fast as possible. I had been thinking that the way to make a good fromfile was Cython, so you've inspired me to think about it some more. Would you be interested in extending what you're doing to a more general purpose tool? Anyway, a comment or two:
cdef extern from 'stdlib.h': double atof(char*)
One thing I found with the current numpy code is that the use of the ato* functions is a source of a lot of bugs (all of them?) the core problem is error handling -- you have to do a lot of pointer checking to see if a call was successful, and with the fromfile code, that error handling is not done in all the layers of calls. Anyone know what the advantage of ato* is over scanf()/fscanf()? Also, why are you doing string parsing rather than parsing the files directly, wouldn't that be a bit faster? I've got some C extension code for simple parsing of text files into arrays of floats or doubles (using fscanf). I'd be curious how the performance compares to what you've got. Let me know if you're interested. -Chris
def test(): py_string = '100' cdef char* c_string = py_string cdef int i, j cdef double val i = 0 j = 2048*1200 cdef np.ndarray[np.float64_t, ndim=1] ret
ret_arr = np.empty((2048*1200,), dtype=np.float64) ret = ret_arr
d = time.time() while i<j: c_string = py_string ret[i] = atof(c_string) i += 1 ret_arr.shape = (1200, 2048) print ret_arr, ret_arr.shape, time.time()-d
The loop now takes only 0.11 seconds to execute. Thanks again. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Tue, Nov 16, 2010 at 11:46 AM, Christopher Barker <Chris.Barker@noaa.gov> wrote:
On 11/16/10 7:31 AM, Darren Dale wrote:
On Tue, Nov 16, 2010 at 9:55 AM, Pauli Virtanen<pav@iki.fi> wrote:
Tue, 16 Nov 2010 09:41:04 -0500, Darren Dale wrote: [clip]
That loop takes 0.33 seconds to execute, which is a good start. I need some help converting this example to return an actual numpy array. Could anyone please offer a suggestion?
Darren,
It's interesting that you found fromstring() so slow -- I've put some time into trying to get fromfile() and fromstring() to be a bit more robust and featurefull, but found it to be some really painful code to work on -- but it didn't dawn on my that it would be slow too! I saw all the layers of function calls, but I still thought that would be minimal compared to the actual string parsing. I guess not. Shows that you never know where your bottlenecks are without profiling.
"Slow" is relative, of course, but since the whole point of fromfile/string is performance (otherwise, we'd just parse with python), it would be nice to get them as fast as possible.
I had been thinking that the way to make a good fromfile was Cython, so you've inspired me to think about it some more. Would you be interested in extending what you're doing to a more general purpose tool?
Anyway, a comment or two:
cdef extern from 'stdlib.h': double atof(char*)
One thing I found with the current numpy code is that the use of the ato* functions is a source of a lot of bugs (all of them?) the core problem is error handling -- you have to do a lot of pointer checking to see if a call was successful, and with the fromfile code, that error handling is not done in all the layers of calls.
In my case, I am making an assumption about the integrity of the file.
Anyone know what the advantage of ato* is over scanf()/fscanf()?
Also, why are you doing string parsing rather than parsing the files directly, wouldn't that be a bit faster?
Rank inexperience, I guess. I don't understand what you have in mind. scanf/fscanf don't actually convert strings to numbers, do they?
I've got some C extension code for simple parsing of text files into arrays of floats or doubles (using fscanf). I'd be curious how the performance compares to what you've got. Let me know if you're interested.
I'm curious, yes. Darren

On 11/16/10 8:57 AM, Darren Dale wrote:
In my case, I am making an assumption about the integrity of the file.
That does make things easier, but less universal. I guess this is the whole trade-off about "reusable code". It sure it a lot easier to write code that does the one thing you need than something general purpose.
Anyone know what the advantage of ato* is over scanf()/fscanf()?
Also, why are you doing string parsing rather than parsing the files directly, wouldn't that be a bit faster?
Rank inexperience, I guess. I don't understand what you have in mind.
if your goal is to read numbers from an ascii file, you can use fromfile() directly, rather than reading the file (or some of it) into a string, and then using fromstring(). Also, in C, you can use fscanf to read the file directly (of course, under the hood, it's putting stuff in stings somewhere along the line, but presumably in an optimized way.
scanf/fscanf don't actually convert strings to numbers, do they?
yes, that's exactly what they do. http://en.wikipedia.org/wiki/Scanf The C lib may very well use ato* under the hood. My idea at this point is to write a function in Cython to takes a file and a numpy dtype, converts the dtype to a scanf format string, then calls fscanf (or scanf) to parse out the file. My existing scanner code more or less does that, but the format string is hard-code to be either for floats or doubles.
I've got some C extension code for simple parsing of text files into arrays of floats or doubles (using fscanf). I'd be curious how the performance compares to what you've got. Let me know if you're interested.
I'm curious, yes.
OK -- I'll whip up a test similar to yours -- stay tuned! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 11/16/10 10:01 AM, Christopher Barker wrote:
OK -- I'll whip up a test similar to yours -- stay tuned!
Here's what I've done: import numpy as np from maproomlib.utility import file_scanner def gen_file(): f = file('test.dat', 'w') for i in range(1200): f.write('1 ' * 2048) f.write('\n') f.close() def read_file1(): """ read unknown length: doubles""" f = file('test.dat') arr = file_scanner.FileScan(f) f.close() return arr def read_file2(): """ read known length: doubles""" f = file('test.dat') arr = file_scanner.FileScanN(f, 1200*2048) f.close() return arr def read_file3(): """ read known length: singles""" f = file('test.dat') arr = file_scanner.FileScanN_single(f, 1200*2048) f.close() return arr def read_fromfile1(): """ read unknown length with fromfile(): singles""" f = file('test.dat') arr = np.fromfile(f, dtype=np.float32, sep=' ') f.close() return arr def read_fromfile2(): """ read unknown length with fromfile(): doubles""" f = file('test.dat') arr = np.fromfile(f, dtype=np.float64, sep=' ') f.close() return arr def read_fromstring1(): """ read unknown length with fromstring(): singles""" f = file('test.dat') str = f.read() arr = np.fromstring(str, dtype=np.float32, sep=' ') f.close() return arr And the results (ipython's timeit): In [40]: timeit test.read_fromfile1() 1 loops, best of 3: 561 ms per loop In [41]: timeit test.read_fromfile2() 1 loops, best of 3: 570 ms per loop In [42]: timeit test.read_file1() 1 loops, best of 3: 336 ms per loop In [43]: timeit test.read_file2() 1 loops, best of 3: 341 ms per loop In [44]: timeit test.read_file3() 1 loops, best of 3: 515 ms per loop In [46]: timeit test.read_fromstring1() 1 loops, best of 3: 301 ms per loop So my filescanner is faster, but not radically so, than fromfile(). However, reading the whole file into a string, then using fromstring() is, in fact, tne fastest method -- interesting -- shows you why you need to profile! Also, with my code, reading singles is slower than doubles -- odd. Perhaps the C lib fscanf read doubles anyway, then converts to singles? Anyway, for my needs, my file_scanner and fromfile() are fast enough, and much faster than parsing the files with Python. My issue with fromfile is flexibility and robustness -- it's buggy in the face of ill-formed files. See the list archives and the bug reports for more detail. Still, it seems your very basic method is indeed a faster way to go. I've enclosed the files. It's currently built as part of a larger lib, so no setup.py -- though it could be written easily enough. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Tue, Nov 16, 2010 at 10:31 AM, Darren Dale <dsdale24@gmail.com> wrote:
On Tue, Nov 16, 2010 at 9:55 AM, Pauli Virtanen <pav@iki.fi> wrote:
Tue, 16 Nov 2010 09:41:04 -0500, Darren Dale wrote: [clip]
That loop takes 0.33 seconds to execute, which is a good start. I need some help converting this example to return an actual numpy array. Could anyone please offer a suggestion?
Easiest way is probably to use ndarray buffers and resize them when needed. For example:
https://github.com/pv/scipy-work/blob/enh/interpnd-smooth/scipy/spatial/qhul...
Thank you Pauli. That makes it *incredibly* simple:
import time cimport numpy as np import numpy as np
cdef extern from 'stdlib.h': double atof(char*)
def test(): py_string = '100' cdef char* c_string = py_string cdef int i, j cdef double val i = 0 j = 2048*1200 cdef np.ndarray[np.float64_t, ndim=1] ret
ret_arr = np.empty((2048*1200,), dtype=np.float64) ret = ret_arr
d = time.time() while i<j: c_string = py_string ret[i] = atof(c_string) i += 1 ret_arr.shape = (1200, 2048) print ret_arr, ret_arr.shape, time.time()-d
The loop now takes only 0.11 seconds to execute. Thanks again.
One follow-up issue: I can't cythonize this code for python-3. I've installed numpy with the most recent changes to the 1.5.x maintenance branch, then re-installed cython-0.13, but when I run "python3 setup.py build_ext --inplace" with this setup script: from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext import numpy setup( cmdclass = {'build_ext': build_ext}, ext_modules = [ Extension( "test_open", ["test_open.pyx"], include_dirs=[numpy.get_include()] ) ] ) I get the following error. Any suggestions what I need to fix, or should I report it to the cython list? $ python3 setup.py build_ext --inplace running build_ext cythoning test_open.pyx to test_open.c Error converting Pyrex file to C: ------------------------------------------------------------ ... # For use in situations where ndarray can't replace PyArrayObject*, # like PyArrayObject**. pass ctypedef class numpy.ndarray [object PyArrayObject]: cdef __cythonbufferdefaults__ = {"mode": "strided"} ^ ------------------------------------------------------------ /Users/darren/.local/lib/python3.1/site-packages/Cython/Includes/numpy.pxd:173:49: "mode" is not a buffer option Error converting Pyrex file to C: ------------------------------------------------------------ ... cdef char* c_string = py_string cdef int i, j cdef double val i = 0 j = 2048*1200 cdef np.ndarray[np.float64_t, ndim=1] ret ^ ------------------------------------------------------------ /Users/darren/temp/test/test_open.pyx:16:8: 'ndarray' is not a type identifier building 'test_open' extension /usr/bin/gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/darren/.local/lib/python3.1/site-packages/numpy/core/include -I/opt/local/Library/Frameworks/Python.framework/Versions/3.1/include/python3.1 -c test_open.c -o build/temp.macosx-10.6-x86_64-3.1/test_open.o test_open.c:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation. error: command '/usr/bin/gcc-4.2' failed with exit status 1
participants (4)
-
Christopher Barker
-
Darren Dale
-
Pauli Virtanen
-
william ratcliff