<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Vincent Davis wrote:

<blockquote

 cite="mid:77e831101003092044s6e93e7aav601270f843d1d4a1@mail.gmail.com"

 type="cite">I have never used the difflib or similar and have a few

questions.

  <div>I am working with DNA sequences of length 25. I have a list of

230,000 and need to look for each sequence in the entire genome

(toxoplasma parasite) I am not sure how large the genome is but more

that 230,000 sequences.</div>

  <div>The are programs that do this and really fast, and they eve do

partial matches but not quite what I need. So I am looking to build a

custom solution.</div>

  <div>I need to look for each of my sequences of 25 characters example(<span

 class="Apple-style-span" style="font-family: Verdana;">AGCCTCCCATGATTGAACAGATCAT).</span></div>

  <div><span class="Apple-style-span" style="font-family: Verdana;">The

genome is formatted as a continuos string

(CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG.........)</span></div>

  <div><font class="Apple-style-span" face="Verdana"><br>

  </font></div>

  <div><font class="Apple-style-span" face="Verdana">I don't care where

or how many times on if it exists. This is simple I think,&nbsp;<span

 class="Apple-style-span"

 style="font-family: sans-serif; font-size: 16px;"><tt

 class="descclassname"

 style="padding: 0px 1px; background-color: transparent; font-size: 0.95em;">str.</tt><tt

 class="descname"

 style="padding: 0px 1px; background-color: transparent;"><span

 class="Apple-style-span" style="font-size: small;">find</span></tt><tt

 class="descname"

 style="padding: 0px 1px; background-color: transparent;"><font

 class="Apple-style-span" face="Verdana"><span class="Apple-style-span"

 style="font-size: small;">(AGCCTCCCATGATTGAACAGATCAT)</span></font></tt></span></font></div>

  <div><font class="Apple-style-span" face="Verdana"><br>

  </font></div>

  <div><font class="Apple-style-span" face="Verdana">But I also what to

find a close match defined as only wrong at 1 location and I what to

record the location. I am not sure how do do this. The only thing I can

think of is using a wildcard and performing the search with a wildcard

in each position. ie 25 time.</font></div>

  <div><font class="Apple-style-span" face="Verdana">For example</font></div>

  <div><font class="Apple-style-span" face="Verdana">AGCCTCCCATGATTGAACAGATCAT</font></div>

  <div><font class="Apple-style-span" face="Verdana">AGCCTCCCATGATAGAACAGATCAT</font></div>

  <div><font class="Apple-style-span" face="Verdana">close match with a

miss-match at position 13</font></div>

</blockquote>

<br>

also :<br>

<br>

sequence = 'AGGCTTGCGGAGCCTGAGGGCGGAG'<br>

seqList = ['*' + sequence[0:i] + '?' + sequence[i+1:] + '*' for i in

range(len(sequence))]<br>

import fnmatch<br>

<br>

genome =

'CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG........'<br>

if any(fnmatch.fnmatch(genome, i) for i in seqList)<br>

&nbsp;&nbsp;&nbsp; print 'It matches'<br>

<br>

Which might be better if the sequence is fixed and the genome changes

inside a loop.<br>

<br>

HTH<br>

<br>

<br>

<br>

</body>

</html>