extracting substrings from a file

Mon Sep 11 09:57:10 EDT 2006

sofiafig at gmail.com wrote:
> Hi,
>
> I have a file with several entries in the form:
>
> AFFX-BioB-5_at	 E. coli  /GEN=bioB  /gb:J04423.1  NOTE=SIF
> corresponding to nucleotides 2032-2305 of /gb:J04423.1  DEF=E.coli
> 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
> 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
> dethiobiotin synthetase (bioD), complete cds.
>
> 1415785_a_at	 /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
> /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
> /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
> /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
>
> and I would like to create a file that has only the following:
>
> AFFX-BioB-5_at  /GEN=bioB  /gb:J04423.1
>
> 1415785_a_at	 /gb:NM_009840.1 /GEN=Cct8
>
> Could anyone please tell me how can I do it?
>
> Many thanks in advance
> Sofia

Here's my first iteration:
C:\junk>type sofia.py
prefixes = ['/GEN=', '/gb:']

def extract(fname):
    f = open(fname, 'r')
    chunks = [[]]
    for line in f:
        words = line.split()
        if words:
            chunks[-1].extend(words)
        else:
            chunks.append([])
    for chunk in chunks:
        if not chunk:
            continue
        output = [chunk[0]]
        for word in chunk:
            for prefix in prefixes:
                if word.startswith(prefix):
                    output.append(word)
                    break
        print ' '.join(output)

if __name__ == "__main__":
    import sys
    extract(sys.argv[1])

C:\junk>sofia.py sofia.txt
AFFX-BioB-5_at /GEN=bioB /gb:J04423.1 /gb:J04423.1
1415785_a_at /gb:NM_009840.1 /GEN=Cct8 /gb:BC009007.1

Before I fix the duplicate in the first line, you need to say whether
you really want the
/gb:BC009007.1 in the second line thrown away -- IOW, what's the rule?
For each prefix, either (1) get the first "word" that starts with that
prefix or (2) get all unique such words. You choose.

Cheers,
John