[Tutor] Text Processing Query

Thu Mar 14 18:41:53 CET 2013

Spyros Charonis wrote:
> Hello Pythoners,
> 
> I am trying to extract certain fields from a file that whose text looks like this:
> 
> COMPND   2 MOLECULE: POTASSIUM CHANNEL SUBFAMILY K MEMBER 4;
> COMPND   3 CHAIN: A, B;
> 
> COMPND  10 MOL_ID: 2;
> COMPND  11 MOLECULE: ANTIBODY FAB FRAGMENT LIGHT CHAIN;
> COMPND  12 CHAIN: D, F;
> COMPND  13 ENGINEERED: YES;
> COMPND  14 MOL_ID: 3;
> COMPND  15 MOLECULE: ANTIBODY FAB FRAGMENT HEAVY CHAIN;
> COMPND  16 CHAIN: E, G;
> 
> I would like the chain IDs, but only those following the text heading "ANTIBODY FAB FRAGMENT", i.e. I
> need to create a list with D,F,E,G  which excludes A,B which have a non-antibody text heading. I am
> using the following syntax:
> 
> with open(filename) as file:
>     scanfile=file.readlines()
>     for line in scanfile:
>         if line[0:6]=='COMPND' and 'FAB FRAGMENT' in line: continue
>         elif line[0:6]=='COMPND' and 'CHAIN' in line:
>             print line

There is no reason to use readlines in this example, just
iterate over the file object directly. 

 with open(filename) as file:
     for line in file:
         if line[0:6]=='COMPND' and 'FAB FRAGMENT' in line: continue
         elif line[0:6]=='COMPND' and 'CHAIN' in line:
             print line

> 
> But this yields:
> 
> COMPND   3 CHAIN: A, B;
> COMPND  12 CHAIN: D, F;
> COMPND  16 CHAIN: E, G;
> 
> I would like to ignore the first line since A,B correspond to non-antibody text headings, and instead
> want to extract only D,F & E,G whose text headings are specified as antibody fragments.
> 
> Many thanks,
> Spyros
> 

Will 'FAB FRAGMENT' always be the line before 'CHAIN'? 
If so, then just keep track of the previous line. 

>>> raw
'COMPND   2 MOLECULE: POTASSIUM CHANNEL SUBFAMILY K MEMBER 4;\nCOMPND   3 CHAIN: A, B;\nCOMPND  10 MOL_ID: 2;\nCOMPND  11 MOLECULE: \
ANTIBODY FAB FRAGMENT LIGHT CHAIN;\nCOMPND  12 CHAIN: D, F;\nCOMPND  13 ENGINEERED: YES;\nCOMPND  14 MOL_ID: 3;\nCOMPND  15 MOLECULE\
: ANTIBODY FAB FRAGMENT HEAVY CHAIN;\nCOMPND  16 CHAIN: E, G;'

>>> prev = ''
>>> chains = []
>>> for line in raw.split('\n'):
...     if 'COMPND' in prev and 'FAB FRAGMENT' in prev and 'CHAIN' in line:
...         chains.extend( line.split(':')[1].replace(',','').replace(';','').split())
...     prev = line
...     
>>> chains
['D', 'F', 'E', 'G']

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.