CSV performance
psaffrey at googlemail.com
psaffrey at googlemail.com
Mon Apr 27 09:15:38 EDT 2009
Thanks for your replies. Many apologies for not including the right
information first time around. More information is below.
I have tried running it just on the csv read:
import time
import csv
afile = "largefile.txt"
t0 = time.clock()
print "working at file", afile
reader = csv.reader(open(afile, "r"), delimiter="\t")
for row in reader:
x,y,z = row
t1 = time.clock()
print "finished: %f.2" % (t1 - t0)
$ ./largefilespeedtest.py
working at file largefile.txt
finished: 3.860000.2
A tiny bit of background on the final application: this is biological
data from an affymetrix platform. The csv files are a chromosome name,
a coordinate and a data point, like this:
chr1 3754914 1.19828
chr1 3754950 1.56557
chr1 3754982 1.52371
In the "simple data structures" cod below, I do some jiggery pokery
with the chromosome names to save me storing the same string millions
of times.
import csv
import cStringIO
import numpy
import time
afile = "largefile.txt"
chrommap = {'chrY': 'y', 'chrX': 'x', 'chr13': 'c',
'chr12': 'b', 'chr11': 'a', 'chr10': '0',
'chr17': 'g', 'chr16': 'f', 'chr15': 'e',
'chr14': 'd', 'chr19': 'i', 'chr18': 'h',
'chrM': 'm', 'chr22': 'l', 'chr20': 'j',
'chr21': 'k', 'chr7': '7', 'chr6': '6',
'chr5': '5', 'chr4': '4', 'chr3': '3',
'chr2': '2', 'chr1': '1', 'chr9': '9', 'chr8': '8'}
def getFileLength(fh):
wholefile = fh.read()
numlines = wholefile.count("\n")
fh.seek(0)
return numlines
count = 0
print "reading affy file", afile
fh = open(afile)
n = getFileLength(fh)
chromio = cStringIO.StringIO()
coords = numpy.zeros(n, dtype=int)
points = numpy.zeros(n)
t0 = time.clock()
reader = csv.reader(fh, delimiter="\t")
for row in reader:
if not row:
continue
chrom, coord, point = row
mappedc = chrommap[chrom]
chromio.write(mappedc)
coords[count] = coord
points[count] = point
count += 1
t1 = time.clock()
print "finished: %f.2" % (t1 - t0)
$ ./affyspeedtest.py
reading affy file largefile.txt
finished: 15.540000.2
Thanks again (tugs forelock),
Peter
More information about the Python-list
mailing list