[Tutor] Difflib comparing string sequnces

Vincent Davis vincent at vincentdavis.net
Thu Mar 11 01:33:24 CET 2010


@Ricardo Aráoz
Thanks for your response, Before I saw your response I had posted the
question on stack overflow. See link below. I like your solution better than
the re solution posted.
It looks like this task may take longer than I think. The .re solution I
guess might take more than 10 days. The search string in 80million digits
long. But Obviously I can stop once I find a match and then just move on the
the next sequence.
You might what to post this answer on stackoverflow. I like the more
interactive form of a mailing list but there seems to be a very p\broad
audience on stackoverflow.

Thanks again,
http://stackoverflow.com/questions/2420412/search-for-string-allowing-for-one-mismatches-in-any-location-of-the-string-pyth

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


2010/3/10 Ricardo Aráoz <ricaraoz at gmail.com>

>  Vincent Davis wrote:
>
> I have never used the difflib or similar and have a few questions.
> I am working with DNA sequences of length 25. I have a list of 230,000 and
> need to look for each sequence in the entire genome (toxoplasma parasite) I
> am not sure how large the genome is but more that 230,000 sequences.
> The are programs that do this and really fast, and they eve do partial
> matches but not quite what I need. So I am looking to build a custom
> solution.
> I need to look for each of my sequences of 25 characters example(
> AGCCTCCCATGATTGAACAGATCAT).
> The genome is formatted as a continuos string
> (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG.........)
>
>  I don't care where or how many times on if it exists. This is simple I
> think, str.find(AGCCTCCCATGATTGAACAGATCAT)
>
>  But I also what to find a close match defined as only wrong at 1 location
> and I what to record the location. I am not sure how do do this. The only
> thing I can think of is using a wildcard and performing the search with a
> wildcard in each position. ie 25 time.
> For example
> AGCCTCCCATGATTGAACAGATCAT
> AGCCTCCCATGATAGAACAGATCAT
> close match with a miss-match at position 13
>
>
> also :
>
> sequence = 'AGGCTTGCGGAGCCTGAGGGCGGAG'
> seqList = ['*' + sequence[0:i] + '?' + sequence[i+1:] + '*' for i in
> range(len(sequence))]
> import fnmatch
>
> genome = 'CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG........'
> if any(fnmatch.fnmatch(genome, i) for i in seqList)
>     print 'It matches'
>
> Which might be better if the sequence is fixed and the genome changes
> inside a loop.
>
> HTH
>
>
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100310/357af7c9/attachment.html>


More information about the Tutor mailing list