[Numpy-discussion] numpy 1.7.0 release?

Tue Dec 6 18:52:33 EST 2011

On 06.12.2011, at 11:13PM, Wes McKinney wrote:

> This isn't the place for this discussion but we should start talking
> about building a *high performance* flat file loading solution with
> good column type inference and sensible defaults, etc. It's clear that
> loadtable is aiming for highest compatibility-- for example I can read
> a 2800x30 file in < 50 ms with the read_table / read_csv functions I
> wrote myself recent in Cython (compared with loadtable taking > 1s as
> quoted in the pull request), but I don't handle European decimal
> formats and lots of other sources of unruliness. I personally don't
> believe in sacrificing an order of magnitude of performance in the 90%
> case for the 10% case-- so maybe it makes sense to have two functions
> around: a superfast custom CSV reader for well-behaved data, and a
> slower, but highly flexible, function like loadtable to fall back on.
> I think R has two functions read.csv and read.csv2, where read.csv2 is
> capable of dealing with things like European decimal format.

Generally I agree, there's a good case for that, but I have to point out that 
the 1s time quoted there was with all auto-detection extravaganza turned on. 
Actually, if I remember the discussions right, in default, single-pass reading 
mode, it comes even close to genfromtxt and loadtxt (on my machine 
150-200 ms for 2800 rows x 30 columns real*8). Originally loadtxt was intended 
to be that no-frills, fast reader, but in practice it is rarely faster than 
genfromtxt as the conversion from input strings to Python objects seems to 
be the bottleneck most of the time. Speeding that up using Cython certainly 
would be a big gain (and then there also is the request to make loadtxt 
memory-efficient, which I have failed to follow up on for weeks and weeks…)

Cheers,
						Derek