[Tutor] Difflib comparing string sequnces

Ricardo Aráoz ricaraoz at gmail.com
Wed Mar 10 23:57:53 CET 2010


Vincent Davis wrote:
> I have never used the difflib or similar and have a few questions.
> I am working with DNA sequences of length 25. I have a list of 230,000
> and need to look for each sequence in the entire genome (toxoplasma
> parasite) I am not sure how large the genome is but more that 230,000
> sequences.
> The are programs that do this and really fast, and they eve do partial
> matches but not quite what I need. So I am looking to build a custom
> solution.
> I need to look for each of my sequences of 25 characters
> example(AGCCTCCCATGATTGAACAGATCAT).
> The genome is formatted as a continuos string
> (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG.........)
>
> I don't care where or how many times on if it exists. This is simple I
> think, str.find(AGCCTCCCATGATTGAACAGATCAT)
>
> But I also what to find a close match defined as only wrong at 1
> location and I what to record the location. I am not sure how do do
> this. The only thing I can think of is using a wildcard and performing
> the search with a wildcard in each position. ie 25 time.
> For example
> AGCCTCCCATGATTGAACAGATCAT
> AGCCTCCCATGATAGAACAGATCAT
> close match with a miss-match at position 13

also :

sequence = 'AGGCTTGCGGAGCCTGAGGGCGGAG'
seqList = ['*' + sequence[0:i] + '?' + sequence[i+1:] + '*' for i in
range(len(sequence))]
import fnmatch

genome = 'CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG........'
if any(fnmatch.fnmatch(genome, i) for i in seqList)
    print 'It matches'

Which might be better if the sequence is fixed and the genome changes
inside a loop.

HTH



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100310/92d8ad73/attachment.html>


More information about the Tutor mailing list