[pypy-dev] pypy1.6 slow on string-heavy ops

Jacob Biesinger jake.biesinger at gmail.com
Thu Aug 18 23:35:12 CEST 2011


Hi all,

New to the list and fairly new to pypy.  First of all, congrats on the new
1.6 release-- the growing support for numpy is very exciting (go, fight,
win, take state!).

So I snagged the 1.6 release to test if it would be faster on the kind of
code I often write: Bioinformatics.  In this snippet, the point is to check
the mappability of the genome-- if a particular substring appears more than
once in the genome, the region is called unmappable.

Machine Specs:
64bit Ubuntu 11.04
119048 CPython 2.7.1 pystones/second
416667 pypy1.6 pystones/second

The CPython version of the following code takes a bit more than a minute to
run on the 21st chromosome of the human reference genome, but the pypy
version has been going for 27+ minutes and hasn't yet finished the first
step of loading the genome as a dict of strings.

Am I using some construct that's particularly difficult for pypy?  Am I
missing something?

hg18 chrom 21 is available at
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/chr21.fa.gz


import sys

def slide_dna(dna, windowsize):
    for i in xrange(len(dna) - windowsize):
        slice = dna[i:i+windowsize]
        if 'N' not in slice:
            yield slice

fasta_file = sys.argv[1]  # should be *.fa
print 'loading dna from', fasta_file
chroms = {}
dna = None
for l in open(fasta_file):
    if l.startswith('>'):  # new chromosome
        if dna is not None:
            chroms[chrom] = dna
        chrom = l.strip().replace('>', '')
        dna = ''
    else:
        dna += l.rstrip()
if dna is not None:
    chroms[chrom] = dna

for length in [15]:#, 25, 35, 45, 55, 65, 75]:
    print 'now on', length
    mappable = 0
    repeat = 0
    s = {}
    for dna in chroms.itervalues():
        for slice in slide_dna(dna, length):
            try:
                s[slice] += 1
            except KeyError:
                s[slice] = 1
    print 'built, now counting'
    for dna in chroms.itervalues():
        for slice in slide_dna(dna, length):
            if s[slice] == 1:
                mappable += 1
            else:
                repeat += 1
    print 'for substring length %s, mappable: %s, repeat: %s' % (length,
mappable, repeat)


--
Jake Biesinger
Graduate Student
Xie Lab, UC Irvine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20110818/32fd96a6/attachment.html>


More information about the pypy-dev mailing list