CSV performance
Jorgen Grahn
grahn+nntp at snipabacken.se
Wed Apr 29 17:28:55 EDT 2009
On Mon, 27 Apr 2009 23:56:47 +0200, dean <deank at yahoo.com> wrote:
> On Mon, 27 Apr 2009 04:22:24 -0700 (PDT), psaffrey at googlemail.com wrote:
>
>> I'm using the CSV library to process a large amount of data - 28
>> files, each of 130MB. Just reading in the data from one file and
>> filing it into very simple data structures (numpy arrays and a
>> cstringio) takes around 10 seconds. If I just slurp one file into a
>> string, it only takes about a second, so I/O is not the bottleneck. Is
>> it really taking 9 seconds just to split the lines and set the
>> variables?
>
> I assume you're reading a 130 MB text file in 1 second only after OS
> already cashed it, so you're not really measuring disk I/O at all.
>
> Parsing a 130 MB text file will take considerable time no matter what.
> Perhaps you should consider using a database instead of CSV.
Why would that be faster? (Assuming all data is actually read from the
database into data structures in the program, as in the text file
case.)
I am asking because people who like databases tend to overestimate the
time it takes to parse text. (And I guess people like me who prefer
text files tend to underestimate the usefullness of databases.)
/Jorgen
--
// Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se> R'lyeh wgah'nagl fhtagn!
More information about the Python-list
mailing list