[Tutor] Selecting text

Wed Jan 19 14:24:43 CET 2005

On 19 Jan 2005, ps_python at yahoo.com wrote:

> I have two lists:
>
> 1. Lseq:
>
>>>> len(Lseq)
> 30673
>>>> Lseq[20:25]
> ['NM_025164', 'NM_025164', 'NM_012384', 'NM_006380',
> 'NM_007032','NM_014332']
>
>
> 2. refseq:
>>>> len(refseq)
> 1080945
>>>> refseq[0:25]
> ['>gi|10047089|ref|NM_014332.1| Homo sapiens small
> muscle protein, X-linked (SMPX), mRNA',
> 'GTTCTCAATACCGGGAGAGGCACAGAGCTATTTCAGCCACATGAAAAGCATCGGAATTGAGATCGCAGCT',
> 'CAGAGGACACCGGGCGCCCCTTCCACCTTCCAAGGAGCTTTGTATTCTTGCATCTGGCTGCCTGGGACTT',
[...]
> 'ACTTTGTATGAGTTCAAATAAATATTTGACTAAATGTAAAATGTGA',
> '>gi|10047091|ref|NM_013259.1| Homo sapiens neuronal
> protein (NP25), mRNA',
[...]

> If Lseq[i] is present in refseq[k], then I am
> interested in printing starting from refseq[k] until
> the element that starts with '>' sign. 
>
> my Lseq has NM_014332 element and this is also present
> in second list refseq. I want to print starting from
> element where NM_014332 is present until next element
> that starts with '>' sign.

> I could not think of any smart way to do this,
> although I have tried like this:

I give you the same answer I think you got the last times you asked such
a question: use a dictionary if you want to search items.

So how to do it?
You could build a dictionary from refseq where the elements that can
match the elemenst from Lseq are the keys.

Then you iterate over Lseq, look if you find a key in your dictionary
and if yes print the matching elemnt from the list.

The next function creates a dictionary.  The keys are the
NM_... entries the values are the start and end indice of the
corresponding entries.

def build_dic (seq):
    keys = []
    indice = []
    for ind, entry in enumerate(seq):
        if entry.startswith('>'):
            key = entry.split('|')[3]
            keys.append(key)
            indice.append(ind)
    indice.append(-1)
    return dict(zip(keys, zip(indice, indice[1:])))

With that function you search for matching keys and if a match is found
use the start and end index to extract the right elements from the list.

def find_matching (rseq, lseq):
    d = build_dic(rseq)
    for key in lseq:
        if key in d:
            start, end = d[key]
            print rseq[start:end]

   Karl
-- 
Please do *not* send copies of replies to me.
I read the list