Thanks to Fernando Perez and Travis Oliphant for pointing me to:
scipy.io.read_array
In testing, I've found that it's very slow (for my needs), though quite nifty in other ways, so I'm sure I'll find a use for it in the future. Travis Oliphant wrote:
Alternatively, we could move some of the Python code in read_array to C to improve the speed.
That was beyond me, so I wrote a very simple module in C that does what
I want, and it is very much faster than read_array or straight python
version. It has two functions:
FileScan(file)
"""
Reads all the values in rest of the ascii file, and produces a Numeric
vector full of Floats (C doubles).
All text in the file that is not part of a floating point number is
skipped over.
"""
FileScanN(file, N)
"""
Reads N values in the ascii file, and produces a Numeric vector of
length N full of Floats (C doubles).
Raises an exception if there are fewer than N numbers in the file.
All text in the file that is not part of a floating point number is
skipped over.
After reading N numbers, the file is left before the next non-whitespace
character in the file. This will often leave the file at the start of
the next line, after scanning a line full of numbers.
"""
I implemented them separately, 'cause I wasn't sure how to deal with
optional arguments in a C function. They could easily have wrapped in a
Python function if you wanted one interface.
FileScan was much more complex, as I had to deal with all the dynamic
memory allocation. I probably took a more complex approach to this than
I had to, but it was an exercise for me, being a newbie at C.
I also decided not to specify a shape for the resulting array, always
returning a rank-1 array, as that made the code easier, and you can
always set A.shape afterward. This could be put in a Python wrapper as well.
It has the obvious limitation that it only does doubles. I'd like to add
longs as well, but probably won't have a need for anything else. The way
memory is these days, it seems just as easy to read the long ones, and
convert afterward if you want.
Here is a quick benchmark (see below) run with a file that is 63,000
lines, with two comma-delimited numbers on each line. Run on a 1GHz P4
under Linux.
Reading with read_array
it took 16.351712 seconds to read the file with read_array
Reading with Standard Python methods
it took 2.832078 seconds to read the file with standard Python methods
Reading with FileScan
it took 0.444431 seconds to read the file with FileScan
Reading with FileScanN
it took 0.407875 seconds to read the file with FileScanN
As you can see, read_array is painfully slow for this kind of thing,
straight Python is OK, and FileScan is pretty darn fast.
I've enclosed the C code and setup.py, if anyone wants to take a look,
and use it, or give suggestions or bug fixes or whatever, that would be
great.
In particular, I don't think I've structured the code very well, and
there could be memory leak, which I have not tested carefully for.
Tested only on Linux with Python2.3.3, Numeric 23.1. If someone wants to
port it to numarray, that would be great too.
-Chris
The benchmark:
def test6():
"""
Testing various IO options
"""
from scipy.io import array_import
filename = "JunkBig.txt"
file = open(filename)
print "Reading with read_array"
start = time.time()
A = array_import.read_array(file,",")
print "it took %f seconds to read the file with
read_array"%(time.time() - start)
file.close()
file = open(filename)
print "Reading with Standard Python methods"
start = time.time()
A = []
for line in file:
A.append( map ( float, line.strip().split(",") ) )
A = array(A)
print "it took %f seconds to read the file with standard Python
methods"%(time.time() - start)
file.close()
file = open(filename)
print "Reading with FileScan"
start = time.time()
A = FileScanner.FileScan(file)
A.shape = (-1,2)
print "it took %f seconds to read the file with
FileScan"%(time.time() - start)
file.close()
file = open(filename)
print "Reading with FileScanN"
start = time.time()
A = FileScanner.FileScanN(file, product(A.shape) )
A.shape = (-1,2)
print "it took %f seconds to read the file with
FileScanN"%(time.time() - start)
--
Christopher Barker, Ph.D.
Oceanographer
NOAA/OR&R/HAZMAT (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
#include "Python.h"
#include