[Tutor] Difflib comparing string sequnces

Vincent Davis vincent at vincentdavis.net
Wed Mar 10 05:44:38 CET 2010


I have never used the difflib or similar and have a few questions.
I am working with DNA sequences of length 25. I have a list of 230,000 and
need to look for each sequence in the entire genome (toxoplasma parasite) I
am not sure how large the genome is but more that 230,000 sequences.
The are programs that do this and really fast, and they eve do partial
matches but not quite what I need. So I am looking to build a custom
solution.
I need to look for each of my sequences of 25 characters example(
AGCCTCCCATGATTGAACAGATCAT).
The genome is formatted as a continuos string
(CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG.........)

I don't care where or how many times on if it exists. This is simple I
think, str.find(AGCCTCCCATGATTGAACAGATCAT)

But I also what to find a close match defined as only wrong at 1 location
and I what to record the location. I am not sure how do do this. The only
thing I can think of is using a wildcard and performing the search with a
wildcard in each position. ie 25 time.
For example
AGCCTCCCATGATTGAACAGATCAT
AGCCTCCCATGATAGAACAGATCAT
close match with a miss-match at position 13


  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100309/c620b5ac/attachment-0001.html>


More information about the Tutor mailing list