<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Vincent Davis wrote:
<blockquote
cite="mid:77e831101003092044s6e93e7aav601270f843d1d4a1@mail.gmail.com"
type="cite">I have never used the difflib or similar and have a few
questions.
<div>I am working with DNA sequences of length 25. I have a list of
230,000 and need to look for each sequence in the entire genome
(toxoplasma parasite) I am not sure how large the genome is but more
that 230,000 sequences.</div>
<div>The are programs that do this and really fast, and they eve do
partial matches but not quite what I need. So I am looking to build a
custom solution.</div>
<div>I need to look for each of my sequences of 25 characters example(<span
class="Apple-style-span" style="font-family: Verdana;">AGCCTCCCATGATTGAACAGATCAT).</span></div>
<div><span class="Apple-style-span" style="font-family: Verdana;">The
genome is formatted as a continuos string
(CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG.........)</span></div>
<div><font class="Apple-style-span" face="Verdana"><br>
</font></div>
<div><font class="Apple-style-span" face="Verdana">I don't care where
or how many times on if it exists. This is simple I think, <span
class="Apple-style-span"
style="font-family: sans-serif; font-size: 16px;"><tt
class="descclassname"
style="padding: 0px 1px; background-color: transparent; font-size: 0.95em;">str.</tt><tt
class="descname"
style="padding: 0px 1px; background-color: transparent;"><span
class="Apple-style-span" style="font-size: small;">find</span></tt><tt
class="descname"
style="padding: 0px 1px; background-color: transparent;"><font
class="Apple-style-span" face="Verdana"><span class="Apple-style-span"
style="font-size: small;">(AGCCTCCCATGATTGAACAGATCAT)</span></font></tt></span></font></div>
<div><font class="Apple-style-span" face="Verdana"><br>
</font></div>
<div><font class="Apple-style-span" face="Verdana">But I also what to
find a close match defined as only wrong at 1 location and I what to
record the location. I am not sure how do do this. The only thing I can
think of is using a wildcard and performing the search with a wildcard
in each position. ie 25 time.</font></div>
<div><font class="Apple-style-span" face="Verdana">For example</font></div>
<div><font class="Apple-style-span" face="Verdana">AGCCTCCCATGATTGAACAGATCAT</font></div>
<div><font class="Apple-style-span" face="Verdana">AGCCTCCCATGATAGAACAGATCAT</font></div>
<div><font class="Apple-style-span" face="Verdana">close match with a
miss-match at position 13</font></div>
</blockquote>
<br>
also :<br>
<br>
sequence = 'AGGCTTGCGGAGCCTGAGGGCGGAG'<br>
seqList = ['*' + sequence[0:i] + '?' + sequence[i+1:] + '*' for i in
range(len(sequence))]<br>
import fnmatch<br>
<br>
genome =
'CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG........'<br>
if any(fnmatch.fnmatch(genome, i) for i in seqList)<br>
print 'It matches'<br>
<br>
Which might be better if the sequence is fixed and the genome changes
inside a loop.<br>
<br>
HTH<br>
<br>
<br>
<br>
</body>
</html>