[Numpy-discussion] Possible roadmap addendum: building better text file readers

Thu Feb 23 16:07:04 EST 2012

On Thu, Feb 23, 2012 at 3:55 PM, Erin Sheldon <erin.sheldon at gmail.com> wrote:
> Excerpts from Wes McKinney's message of Thu Feb 23 15:45:18 -0500 2012:
>> Reasonably wide CSV files with hundreds of thousands to millions of
>> rows. I have a separate interest in JSON handling but that is a
>> different kind of problem, and probably just a matter of forking
>> ultrajson and having it not produce Python-object-based data
>> structures.
>
> As a benchmark, recfile can read an uncached file with 350,000 lines and
> 32 columns in about 5 seconds.  File size ~220M
>
> -e
> --
> Erin Scott Sheldon
> Brookhaven National Laboratory

That's pretty good. That's faster than pandas's csv-module+Cython
approach almost certainly (but I haven't run your code to get a read
on how much my hardware makes a difference), but that's not shocking
at all:

In [1]: df = DataFrame(np.random.randn(350000, 32))

In [2]: df.to_csv('/home/wesm/tmp/foo.csv')

In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv')
CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s
Wall time: 7.04 s

I must think that skipping the process of creating 11.2 mm Python
string objects and then individually converting each of them to float.

Note for reference (i'm skipping the first row which has the column
labels from above):

In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv',
dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys:
0.48 s, total: 24.65 s
Wall time: 24.67 s

In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv',
delimiter=',', skiprows=1)
CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s
Wall time: 11.32 s

In this last case for example, around 500 MB of RAM is taken up for an
array that should only be about 80-90MB. If you're a data scientist
working in Python, this is _not good_.

-W