CSV performance

psaffrey at googlemail.com psaffrey at googlemail.com
Mon Apr 27 15:15:38 CEST 2009

Thanks for your replies. Many apologies for not including the right
information first time around. More information is below.

I have tried running it just on the csv read:

import time
import csv

afile = "largefile.txt"

t0 = time.clock()

print "working at file", afile
reader = csv.reader(open(afile, "r"), delimiter="\t")
for row in reader:
	x,y,z = row

t1 = time.clock()

print "finished: %f.2" % (t1 - t0)

$ ./largefilespeedtest.py
working at file largefile.txt
finished: 3.860000.2

A tiny bit of background on the final application: this is biological
data from an affymetrix platform. The csv files are a chromosome name,
a coordinate and a data point, like this:

chr1	3754914	1.19828
chr1	3754950	1.56557
chr1	3754982	1.52371

In the "simple data structures" cod below, I do some jiggery pokery
with the chromosome names to save me storing the same string millions
of times.

import csv
import cStringIO
import numpy
import time

afile = "largefile.txt"

chrommap = {'chrY': 'y', 'chrX': 'x', 'chr13': 'c',
			'chr12': 'b', 'chr11': 'a', 'chr10': '0',
			'chr17': 'g', 'chr16': 'f', 'chr15': 'e',
			'chr14': 'd', 'chr19': 'i', 'chr18': 'h',
			'chrM': 'm', 'chr22': 'l', 'chr20': 'j',
			'chr21': 'k', 'chr7': '7', 'chr6': '6',
			'chr5': '5', 'chr4': '4', 'chr3': '3',
			'chr2': '2', 'chr1': '1', 'chr9': '9', 'chr8': '8'}

def getFileLength(fh):
	wholefile = fh.read()
	numlines = wholefile.count("\n")
	return numlines

count = 0
print "reading affy file", afile
fh = open(afile)
n = getFileLength(fh)
chromio = cStringIO.StringIO()
coords = numpy.zeros(n, dtype=int)
points = numpy.zeros(n)

t0 = time.clock()
reader = csv.reader(fh, delimiter="\t")
for row in reader:
	if not row:
	chrom, coord, point = row
	mappedc = chrommap[chrom]
	coords[count] = coord
	points[count] = point
	count += 1
t1 = time.clock()

print "finished: %f.2" % (t1 - t0)

$ ./affyspeedtest.py
reading affy file largefile.txt
finished: 15.540000.2

Thanks again (tugs forelock),


More information about the Python-list mailing list