CSV performance
Tim Chase
python.list at tim.thechases.com
Mon Apr 27 10:51:56 EDT 2009
> I have tried running it just on the csv read:
...
> print "finished: %f.2" % (t1 - t0)
I presume you wanted "%.2f" here. :)
> $ ./largefilespeedtest.py
> working at file largefile.txt
> finished: 3.860000.2
So just the CSV processing of the file takes just shy of 4
seconds and you said that just the pure file-read took about a
second, so that leaves about 3 seconds for CSV processing (or
about 1/3 of the total runtime). In your code example in your
2nd post (with the timing in it), it looks like it took 15+
seconds, meaning the csv code is a mere 1/5 of the runtime. I
also notice that you're reading the file once to find the length,
and reading again to process it.
> The csv files are a chromosome name,
> a coordinate and a data point, like this:
>
> chr1 3754914 1.19828
> chr1 3754950 1.56557
> chr1 3754982 1.52371
Depending on the simplicity of the file-format (assuming nothing
like spaces/tabs in the chromosome name, which your dictionary
seems to indicate is the case), it may be faster to use .split()
to do the work:
for line in file(afile):
a,b,c = line.rstrip('\n\r').split()
The csv module does a lot of smart stuff that it looks like you
may not need.
However, you're still only cutting from that 3-second subset of
your total time. Focusing on the "filing it into very simple
data structures" will likely net you greater improvements. I
don't have much experience with numpy, so I can't offer much to
help. However, rather than reading the file twice, you might try
a general heuristic, assuming lines are no longer than N
characters (they look like they're each 20 chars + a newline) and
then using "filesize/N" to estimate an adequately sized array.
Using stat() on a file to get its size will be a heckuva lot
faster than reading the whole file. I also don't know the
performance of cStringIO.CString() with lots of appending.
However, since each write is just a character, you might do well
to use the array module (unless numpy also has char-arrays) to
preallocate n chars just like you do with your ints and floats:
chromeio[count] = chrommap[chrom]
coords[count] = coord
points[count] = point
count += 1
Just a few ideas to try.
-tkc
More information about the Python-list
mailing list