extracting substrings from a file
Paul McGuire
ptmcg at austin.rr._bogus_.com
Mon Sep 11 10:12:51 EDT 2006
<sofiafig at gmail.com> wrote in message
news:1157977756.841188.8550 at p79g2000cwp.googlegroups.com...
> Hi,
>
> I have a file with several entries in the form:
>
> AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
> corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
> 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
> 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
> dethiobiotin synthetase (bioD), complete cds.
>
> 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
> /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
> /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
> /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
>
> and I would like to create a file that has only the following:
>
> AFFX-BioB-5_at /GEN=bioB /gb:J04423.1
>
> 1415785_a_at /gb:NM_009840.1 /GEN=Cct8
Here's a pyparsing solution that will address your immediate question, and
also gives you some leeway for adding other "/" options to your search.
Pyparsing's home page is at pyparsing.wikispaces.com.
-- Paul
data = """
AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
dethiobiotin synthetase (bioD), complete cds.
1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
/CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
/DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
/PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
"""
from pyparsing import *
# create expression we are looking for:
# name [ junk word... ] /qualifier...
name = Word(alphanums,printables).setResultsName("name")
junkWord = ~(Literal("/")) + Word(printables)
qualifier = ("/" + Word(alphas+"_-.").setResultsName("key") + \
oneOf("= :") + \
Word(printables).setResultsName("value"))
expr = name + ZeroOrMore(junkWord) + \
Dict(ZeroOrMore(qualifier)).setResultsName("quals")
# use parse action to repackage qualifier data to support "dict"-like
# access to qualifiers
qualifier.setParseAction( lambda t: (t.key,"".join(t)) )
# use this parse action instead if you just want whatever is
# after the '=' or ':' delimiter in the qualifier
# qualifier.setParseAction( lambda t: (t.key,t.value) )
# parse data strings, showing returned data structure
# (just to show what pyparsing results structure looks like)
for d in data.split("\n\n"):
res = expr.parseString(d)
print res.dump()
print
print
# now just do what the OP wanted in the first place
for d in data.split("\n\n"):
res = expr.parseString(d)
print res.name, res.quals["gb"], res.quals["GEN"]
Gives these results:
['AFFX-BioB-5_at', 'E.', 'coli', [('GEN', '/GEN=bioB'), ('gb',
'/gb:J04423.1')]]
- name: AFFX-BioB-5_at
- quals: [('GEN', '/GEN=bioB'), ('gb', '/gb:J04423.1')]
- GEN: /GEN=bioB
- gb: /gb:J04423.1
['1415785_a_at', [('gb', '/gb:NM_009840.1'), ('DB_XREF',
'/DB_XREF=gi:6753327'), ('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'),
('CNT', '/CNT=482'), ('TID', '/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'),
('STK', '/STK=281'), ('UG', '/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF',
'/DEF=Mus')]]
- name: 1415785_a_at
- quals: [('gb', '/gb:NM_009840.1'), ('DB_XREF', '/DB_XREF=gi:6753327'),
('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'), ('CNT', '/CNT=482'), ('TID',
'/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'), ('STK', '/STK=281'), ('UG',
'/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF', '/DEF=Mus')]
- CNT: /CNT=482
- DB_XREF: /DB_XREF=gi:6753327
- DEF: /DEF=Mus
- FEA: /FEA=FLmRNA
- GEN: /GEN=Cct8
- LL: /LL=12469
- STK: /STK=281
- TID: /TID=Mm.17989.1
- TIER: /TIER=FL+Stack
- UG: /UG=Mm.17989
- gb: /gb:NM_009840.1
AFFX-BioB-5_at /gb:J04423.1 /GEN=bioB
1415785_a_at /gb:NM_009840.1 /GEN=Cct8
More information about the Python-list
mailing list