Newbie - converting csv files to arrays in NumPy - Matlab vs. Numpy comparison

Sun Jan 14 12:56:34 EST 2007

Thank you so much. Your solution works!  I greatly appreciate your
help.

sturlamolden wrote:
> oyekomova wrote:
>
> > Thanks for your note. I have 1Gig of RAM. Also, Matlab has no problem
> > in reading the file into memory. I am just running Istvan's code that
> > was posted earlier.
>
> You have a CSV file of about 520 MiB, which is read into memory. Then
> you have a list of list of floats, created by list comprehension, which
> is larger than 274 MiB. Additionally you try to allocate a NumPy array
> slightly larger than 274 MiB. Now your process is already exceeding 1
> GiB, and you are probably running other processes too. That is why you
> run out of memory.
>
> So you have three options:
>
> 1. Buy more RAM.
>
> 2. Low-level code a csv-reader in C.
>
> 3. Read the data in chunks. That would mean something like this:
>
>
> import time, csv, random
> import numpy
>
> def make_data(rows=6E6, cols=6):
>     fp = open('data.txt', 'wt')
>     counter = range(cols)
>     for row in xrange( int(rows) ):
>         vals = map(str, [ random.random() for x in counter ] )
>         fp.write( '%s\n' % ','.join( vals ) )
>     fp.close()
>
> def read_test():
>     start  = time.clock()
>     arrlist = None
>     r = 0
>     CHUNK_SIZE_HINT = 4096 * 4 # seems to be good
>     fid = file('data.txt')
>     while 1:
>         chunk = fid.readlines(CHUNK_SIZE_HINT)
>         if not chunk: break
>         reader = csv.reader(chunk)
>         data = [ map(float, row) for row in reader ]
>         arrlist = [ numpy.array(data,dtype=float), arrlist ]
>         r += arrlist[0].shape[0]
>         del data
>         del reader
>         del chunk
>     print 'Created list of chunks, elapsed time so far: ', time.clock()
> - start
>     print 'Joining list...'
>     data = numpy.empty((r,arrlist[0].shape[1]),dtype=float)
>     r1 = r
>     while arrlist:
>         r0 = r1 - arrlist[0].shape[0]
>         data[r0:r1,:] = arrlist[0]
>         r1 = r0
>         del arrlist[0]
>         arrlist = arrlist[0]
>     print 'Elapsed time:', time.clock() - start
>
> make_data()
> read_test()
>
> This can process a CSV file of 6 million rows in about 150 seconds on
> my laptop. A CSV file of 1 million rows takes about 25 seconds.
>
> Just reading the 6 million row CSV file ( using fid.readlines() ) takes
> about 40 seconds on my laptop. Python lists are not particularly
> efficient. You can probably reduce the time to ~60 seconds by writing a
> new CSV reader for NumPy arrays in a C extension.