[Numpy-discussion] Possible roadmap addendum: building better text file readers
Erin Sheldon
erin.sheldon at gmail.com
Thu Feb 23 16:20:36 EST 2012
Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012:
> That's pretty good. That's faster than pandas's csv-module+Cython
> approach almost certainly (but I haven't run your code to get a read
> on how much my hardware makes a difference), but that's not shocking
> at all:
>
> In [1]: df = DataFrame(np.random.randn(350000, 32))
>
> In [2]: df.to_csv('/home/wesm/tmp/foo.csv')
>
> In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv')
> CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s
> Wall time: 7.04 s
>
> I must think that skipping the process of creating 11.2 mm Python
> string objects and then individually converting each of them to float.
>
> Note for reference (i'm skipping the first row which has the column
> labels from above):
>
> In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv',
> dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys:
> 0.48 s, total: 24.65 s
> Wall time: 24.67 s
>
> In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv',
> delimiter=',', skiprows=1)
> CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s
> Wall time: 11.32 s
>
> In this last case for example, around 500 MB of RAM is taken up for an
> array that should only be about 80-90MB. If you're a data scientist
> working in Python, this is _not good_.
It might be good to compare on recarrays, which are a bit more complex.
Can you try one of these .dat files?
http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/
The dtype is
[('ra', 'f8'),
('dec', 'f8'),
('g1', 'f8'),
('g2', 'f8'),
('err', 'f8'),
('scinv', 'f8', 27)]
--
Erin Scott Sheldon
Brookhaven National Laboratory
More information about the NumPy-Discussion
mailing list