Hi all,<div><br></div><div>New to the list and fairly new to pypy. First of all, congrats on the new 1.6 release-- the growing support for numpy is very exciting (go, fight, win, take state!).</div><div><br></div><div>So I snagged the 1.6 release to test if it would be faster on the kind of code I often write: Bioinformatics. In this snippet, the point is to check the mappability of the genome-- if a particular substring appears more than once in the genome, the region is called unmappable.</div>
<div><br></div><div>Machine Specs:</div><div>64bit Ubuntu 11.04</div><div>119048 CPython 2.7.1 pystones/second </div><div>416667 pypy1.6 pystones/second</div><div><div><br></div></div><div>The CPython version of the following code takes a bit more than a minute to run on the 21st chromosome of the human reference genome, but the pypy version has been going for 27+ minutes and hasn't yet finished the first step of loading the genome as a dict of strings. </div>
<div><br></div><div>Am I using some construct that's particularly difficult for pypy? Am I missing something?</div><div><br></div><div>hg18 chrom 21 is available at <a href="http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr21.fa.gz">http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr21.fa.gz</a></div>
<div><br></div><div><br></div><div><div><font face="'courier new', monospace">import sys</font></div><div><div><font face="'courier new', monospace"><br></font></div><div><font face="'courier new', monospace">def slide_dna(dna, windowsize):</font></div>
<div><font face="'courier new', monospace"> for i in xrange(len(dna) - windowsize):</font></div><div><font face="'courier new', monospace"> slice = dna[i:i+windowsize]</font></div><div><font face="'courier new', monospace"> if 'N' not in slice:</font></div>
<div><font face="'courier new', monospace"> yield slice</font></div></div><div><font face="'courier new', monospace"><br></font></div><div><span style="font-family: 'courier new', monospace; ">fasta_file = sys.argv[1] # should be *.fa</span></div>
<div><font face="'courier new', monospace">print 'loading dna from', fasta_file</font></div><div><font face="'courier new', monospace">chroms = {}</font></div><div><font face="'courier new', monospace">dna = None</font></div>
<div><font face="'courier new', monospace">for l in open(fasta_file):</font></div><div><font face="'courier new', monospace"> if l.startswith('>'): # new chromosome</font></div><div><font face="'courier new', monospace"> if dna is not None:</font></div>
<div><font face="'courier new', monospace"> chroms[chrom] = dna</font></div><div><font face="'courier new', monospace"> chrom = l.strip().replace('>', '')</font></div>
<div>
<font face="'courier new', monospace"> dna = ''</font></div><div><font face="'courier new', monospace"> else:</font></div><div><font face="'courier new', monospace"> dna += l.rstrip()</font></div>
<div><font face="'courier new', monospace">if dna is not None:</font></div><div><font face="'courier new', monospace"> chroms[chrom] = dna</font></div><div><font face="'courier new', monospace"><br>
</font></div><div><font face="'courier new', monospace">for length in [15]:#, 25, 35, 45, 55, 65, 75]:</font></div><div><font face="'courier new', monospace"> print 'now on', length</font></div>
<div><font face="'courier new', monospace"> mappable = 0</font></div><div><font face="'courier new', monospace"> repeat = 0</font></div><div><font face="'courier new', monospace"> s = {}</font></div>
<div><font face="'courier new', monospace"> for dna in chroms.itervalues():</font></div><div><font face="'courier new', monospace"> for slice in slide_dna(dna, length):</font></div><div><font face="'courier new', monospace"> try:</font></div>
<div><font face="'courier new', monospace"> s[slice] += 1</font></div><div><font face="'courier new', monospace"> except KeyError:</font></div><div><font face="'courier new', monospace"> s[slice] = 1</font></div>
<div><font face="'courier new', monospace"> print 'built, now counting'</font></div><div><font face="'courier new', monospace"> for dna in chroms.itervalues():</font></div><div><font face="'courier new', monospace"> for slice in slide_dna(dna, length):</font></div>
<div><font face="'courier new', monospace"> if s[slice] == 1:</font></div><div><font face="'courier new', monospace"> mappable += 1</font></div><div><font face="'courier new', monospace"> else:</font></div>
<div><font face="'courier new', monospace"> repeat += 1</font></div><div><font face="'courier new', monospace"> print 'for substring length %s, mappable: %s, repeat: %s' % (length, mappable, repeat)</font></div>
<div><font face="'courier new', monospace"><br></font></div></div><div><br></div><div>--<br>Jake Biesinger<br>Graduate Student<br>Xie Lab, UC Irvine<br><br>
</div>