CSV performance

Mon Apr 27 08:14:28 EDT 2009

On Apr 27, 9:22 pm, "psaff... at googlemail.com"
<psaff... at googlemail.com> wrote:
> I'm using the CSV library to process a large amount of data - 28
> files, each of 130MB. Just reading in the data from one file and
> filing it into very simple data structures (numpy arrays and a
> cstringio) takes around 10 seconds. If I just slurp one file into a
> string, it only takes about a second, so I/O is not the bottleneck. Is
> it really taking 9 seconds just to split the lines and set the
> variables?

I'll assume that that's a rhetorical question. Why are you assuming
that it's a problem with the csv module and not with the "filing it
into very simple data structures"? How long does it take just to read
the CSV file i.e. without any setting the variables? Have you run your
timing tests multiple times and discarded the first 1 or two results?

> Is there some way I can improve the CSV performance?

I doubt it.

> Is there a way I
> can slurp the file into memory and read it like a file from there?

Of course. However why do you think that the double handling will be
faster? Do you have 130MB of real memory free for the file image?

In order to get some meaningful advice, you will need to tell us:
version of Python, version of numpy, OS, amount of memory on the
machine, what CPU; and supply: sample of a few lines of a typical CSV
file, plus your code.

Cheers,
John