[Tutor] Text Processing Query
Prasad, Ramit
ramit.prasad at jpmorgan.com
Thu Mar 14 18:41:53 CET 2013
Spyros Charonis wrote:
> Hello Pythoners,
>
> I am trying to extract certain fields from a file that whose text looks like this:
>
> COMPND 2 MOLECULE: POTASSIUM CHANNEL SUBFAMILY K MEMBER 4;
> COMPND 3 CHAIN: A, B;
>
> COMPND 10 MOL_ID: 2;
> COMPND 11 MOLECULE: ANTIBODY FAB FRAGMENT LIGHT CHAIN;
> COMPND 12 CHAIN: D, F;
> COMPND 13 ENGINEERED: YES;
> COMPND 14 MOL_ID: 3;
> COMPND 15 MOLECULE: ANTIBODY FAB FRAGMENT HEAVY CHAIN;
> COMPND 16 CHAIN: E, G;
>
> I would like the chain IDs, but only those following the text heading "ANTIBODY FAB FRAGMENT", i.e. I
> need to create a list with D,F,E,G which excludes A,B which have a non-antibody text heading. I am
> using the following syntax:
>
> with open(filename) as file:
> scanfile=file.readlines()
> for line in scanfile:
> if line[0:6]=='COMPND' and 'FAB FRAGMENT' in line: continue
> elif line[0:6]=='COMPND' and 'CHAIN' in line:
> print line
There is no reason to use readlines in this example, just
iterate over the file object directly.
with open(filename) as file:
for line in file:
if line[0:6]=='COMPND' and 'FAB FRAGMENT' in line: continue
elif line[0:6]=='COMPND' and 'CHAIN' in line:
print line
>
> But this yields:
>
> COMPND 3 CHAIN: A, B;
> COMPND 12 CHAIN: D, F;
> COMPND 16 CHAIN: E, G;
>
> I would like to ignore the first line since A,B correspond to non-antibody text headings, and instead
> want to extract only D,F & E,G whose text headings are specified as antibody fragments.
>
> Many thanks,
> Spyros
>
Will 'FAB FRAGMENT' always be the line before 'CHAIN'?
If so, then just keep track of the previous line.
>>> raw
'COMPND 2 MOLECULE: POTASSIUM CHANNEL SUBFAMILY K MEMBER 4;\nCOMPND 3 CHAIN: A, B;\nCOMPND 10 MOL_ID: 2;\nCOMPND 11 MOLECULE: \
ANTIBODY FAB FRAGMENT LIGHT CHAIN;\nCOMPND 12 CHAIN: D, F;\nCOMPND 13 ENGINEERED: YES;\nCOMPND 14 MOL_ID: 3;\nCOMPND 15 MOLECULE\
: ANTIBODY FAB FRAGMENT HEAVY CHAIN;\nCOMPND 16 CHAIN: E, G;'
>>> prev = ''
>>> chains = []
>>> for line in raw.split('\n'):
... if 'COMPND' in prev and 'FAB FRAGMENT' in prev and 'CHAIN' in line:
... chains.extend( line.split(':')[1].replace(',','').replace(';','').split())
... prev = line
...
>>> chains
['D', 'F', 'E', 'G']
This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
More information about the Tutor
mailing list