[Numpy-discussion] Possible roadmap addendum: building better text file readers

Thu Feb 23 16:20:36 EST 2012

Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012:
> That's pretty good. That's faster than pandas's csv-module+Cython
> approach almost certainly (but I haven't run your code to get a read
> on how much my hardware makes a difference), but that's not shocking
> at all:
> 
> In [1]: df = DataFrame(np.random.randn(350000, 32))
> 
> In [2]: df.to_csv('/home/wesm/tmp/foo.csv')
> 
> In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv')
> CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s
> Wall time: 7.04 s
> 
> I must think that skipping the process of creating 11.2 mm Python
> string objects and then individually converting each of them to float.
> 
> Note for reference (i'm skipping the first row which has the column
> labels from above):
> 
> In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv',
> dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys:
> 0.48 s, total: 24.65 s
> Wall time: 24.67 s
> 
> In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv',
> delimiter=',', skiprows=1)
> CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s
> Wall time: 11.32 s
> 
> In this last case for example, around 500 MB of RAM is taken up for an
> array that should only be about 80-90MB. If you're a data scientist
> working in Python, this is _not good_.

It might be good to compare on recarrays, which are a bit more complex.
Can you try one of these .dat files?

    http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/

The dtype is

[('ra', 'f8'),
 ('dec', 'f8'),
 ('g1', 'f8'),
 ('g2', 'f8'),
 ('err', 'f8'),
 ('scinv', 'f8', 27)]

-- 
Erin Scott Sheldon
Brookhaven National Laboratory