Philipp Pagel pDOTpagel at
Wed Sep 3 12:41:52 CEST 2008

Francesco Pietra <chiendarret at> wrote:
> ATOM   3424  N   LEU B 428     143.814  87.271  77.726  1.00115.20       2SG3426
> ATOM   3425  CA  LEU B 428     142.918  87.524  78.875  1.00115.20       2SG3427

> As you can see, the number of lines for a particular value in column 6
> changes from situation to situation, and may even be different for the
> same name in column 4. For example, LEU can have a different number of
> lines depending on the position of this amino acid (leucine).

Others have alreade given good hints but I would like to add a bit of

The data you show appears to be a PDB protein structure file. It is
important to realize that these are fixed-width files and columns can be
empty so splitting on tab or whithespace will often fail. It is also
important to know that the residue numbering (cols 23-26) is not
necessarily contiguous and is not even unique without taking into
account the 'insertion code' in column 27 which happens to be empty in
your example. I would recommend to use a full-blown PDB parser to read
the data and then iterate over the residues and do whatever you would
like to acomplish that way. Biopython has such a parser:


Dr. Philipp Pagel
Lehrstuhl f. Genomorientierte Bioinformatik
Technische Universität München

More information about the Python-list mailing list