Re: [Numpy-discussion] How to read data from text files fast?

8 Jul 2004

      Thanks to Fernando Perez  and Travis Oliphant for pointing me to:
...
scipy.io.read_array
In testing, I've found that it's very slow (for my needs), though quite 
nifty in other ways, so I'm sure I'll find a use for it in the future.

Travis Oliphant wrote:
...
Alternatively, we could move some of the Python code in read_array to 
C to improve the speed.
That was beyond me, so I wrote a very simple module in C that does what 
I want, and it is very much faster than read_array or straight python 
version. It has two functions:

FileScan(file)
"""
Reads all the values in rest of the ascii file, and produces a Numeric
vector full of Floats (C doubles).

All text in the file that is not part of a floating point number is
skipped over.
"""

FileScanN(file, N)

"""
Reads N values in the ascii file, and produces a Numeric vector of
length N full of Floats (C doubles).

Raises an exception if there are fewer than N  numbers in the file.

All text in the file that is not part of a floating point number is
skipped over.

After reading N numbers, the file is left before the next non-whitespace
character in the file. This will often leave the file at the start of
the next line, after scanning a line full of numbers.
"""

I implemented them separately, 'cause I wasn't sure how to deal with 
optional arguments in a C function. They could easily have wrapped in a 
Python function if you wanted one interface.

FileScan was much more complex, as I had to deal with all the dynamic 
memory allocation. I probably took a more complex approach to this than 
I had to, but it was an exercise for me, being a newbie at C.

I also decided not to specify a shape for the resulting array, always 
returning a rank-1 array, as that made the code easier, and you can 
always set A.shape afterward. This could be put in a Python wrapper as well.

It has the obvious limitation that it only does doubles. I'd like to add 
longs as well, but probably won't have a need for anything else. The way 
memory is these days, it seems just as easy to read the long ones, and 
convert afterward if you want.

Here is a quick benchmark (see below) run with a file that is 63,000 
lines, with two comma-delimited numbers on each line. Run on a 1GHz P4 
under Linux.

Reading with read_array
it took 16.351712 seconds to read the file with read_array
Reading with Standard Python methods
it took 2.832078 seconds to read the file with standard Python methods
Reading with FileScan
it took 0.444431 seconds to read the file with FileScan
Reading with FileScanN
it took 0.407875 seconds to read the file with FileScanN

As you can see, read_array is painfully slow for this kind of thing, 
straight Python is OK, and FileScan is pretty darn fast.

I've enclosed the C code and setup.py, if anyone wants to take a look, 
and use it, or give suggestions or bug fixes or whatever, that would be 
great.

In particular, I don't think I've structured the code very well, and 
there could be memory leak, which I have not tested carefully for.

Tested only on Linux with Python2.3.3, Numeric 23.1. If someone wants to 
  port it to numarray, that would be great too.

-Chris

The benchmark:

def test6():
     """
     Testing various IO options
     """
     from scipy.io import array_import

     filename = "JunkBig.txt"
     file = open(filename)
     print "Reading with read_array"
     start = time.time()
     A = array_import.read_array(file,",")
     print "it took %f seconds to read the file with 
read_array"%(time.time() - start)
     file.close()

     file = open(filename)
     print "Reading with Standard Python methods"
     start = time.time()
     A = []
     for line in file:
         A.append( map ( float, line.strip().split(",") ) )
     A = array(A)
     print "it took %f seconds to read the file with standard Python 
methods"%(time.time() - start)
     file.close()

     file = open(filename)
     print "Reading with FileScan"
     start = time.time()
     A = FileScanner.FileScan(file)
     A.shape = (-1,2)
     print "it took %f seconds to read the file with 
FileScan"%(time.time() - start)
     file.close()

     file = open(filename)
     print "Reading with FileScanN"
     start = time.time()
     A = FileScanner.FileScanN(file, product(A.shape) )
     A.shape = (-1,2)
     print "it took %f seconds to read the file with 
FileScanN"%(time.time() - start)

-- 
Christopher Barker, Ph.D.
Oceanographer

NOAA/OR&R/HAZMAT         (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov

#include "Python.h"

#include 

// NOTE: these buffer sizes were picked very arbitrarily, and have
// remarkably little impact on performance on my system.

#define BUFFERSIZE1 1024
#define BUFFERSIZE2 64

int filescan(FILE *infile, int NNums, double *array){

    double N;
    int i, j;
    int c;

    for (i=0; i -1){
	 // not EOF, rewind the file one byte.
	 fseek(infile, -1, SEEK_CUR);
     }
    return(i);
}

static char doc_FileScanN[] =
"FileScanN(file, N)\n\n"
"Reads N values in the ascii file, and produces a Numeric vector of\n"
"length N full of Floats (C doubles).\n\n"
"Raises an exception if there are fewer than N  numbers in the file.\n\n"
"All text in the file that is not part of a floating point number is\n"
"skipped over.\n\n"
"After reading N numbers, the file is left before the next non-whitespace\n"
"character in the file. This will often leave the file at the start of\n"
"the next line, after scanning a line full of numbers.\n"
;

static PyObject * FileScanner_FileScanN(PyObject *self, PyObject *args)
{

    PyFileObject *File;
    PyArrayObject *Array;
    int length;

    double *Data;
    int i;

    //printf("Starting\n");

    if (!PyArg_ParseTuple(args, "O!i", &PyFile_Type, &File, &length) ) {
	return NULL;
    }  

    Data = calloc(length, sizeof(double) );

    if ((i = filescan(PyFile_AsFile( (PyObject*)File ), length, Data)) < length){
	    PyErr_SetString (PyExc_ValueError,
                     "End of File reached before all numbers found");
	    free(Data);
	    return NULL;
    }

    Array = (PyArrayObject *) PyArray_FromDims(1, &length, PyArray_DOUBLE);

    for (i = 0; i< length ; i++){
	*(double *)(Array->data + (i * Array->strides[0] ) ) = Data[i];
    }

    free(Data);

    return PyArray_Return(Array);
}

static char doc_FileScan[] =
"FileScan(file)\n\n"
"Reads all the values in rest of the open ascii file: file, and produces\n"
"a Numeric vector full of Floats (C doubles).\n\n"
"All text in the file that is not part of a floating point number is\n"
"skipped over.\n\n"
;

static PyObject * FileScanner_FileScan(PyObject *self, PyObject *args)
{

    FILE *infile;
    char *DataPtr;
    PyFileObject *File;
    PyArrayObject *Array;
    double *(*P_array);
    double *(*Old_P_array);
    int i,j,k;
    int ScanCount = 0;
    int BufferSize = BUFFERSIZE2;
    int OldBufferSize = 0;
    int StartOfBuffer = 0;
    int NumBuffers = 0;

    if (!PyArg_ParseTuple(args, "O!", &PyFile_Type, &File) ) {
	return NULL;
    }  
    infile = PyFile_AsFile( (PyObject*)File );

    P_array = (double**) calloc(BufferSize, sizeof(void*) );
    while (1) {
	for (j=StartOfBuffer; j < BufferSize; j++){
	    P_array[j] = (double*) calloc(BUFFERSIZE1, sizeof(double));
	    NumBuffers++ ;
	    i = filescan(infile, BUFFERSIZE1, P_array[j]);
	    if (i) {
		ScanCount += i;
		//for (k=0; kdata;
    for (j=0; j= ScanCount) {
		break;
	    }
	    *(double *)DataPtr = P_array[j][k];
	    DataPtr +=  Array->strides[0];
	    i++;
	}
    }

    //free all the memory
    for (j=0; j

Re: [Numpy-discussion] How to read data from text files fast?

Chris Barker