data structure design question

Sun Feb 20 07:33:43 EST 2000

My, a thread I can really get into! (albeit, late)
This isn't very Python related, but I'll gab on the
newsgroup anyway - better that than re.compile(r'\s') :)

Felix <felixt at dicksonstreet.com> asked:
> I'm thinking of trying to do some chemistry stuff in
> Python, and one of the things I was thinking of was
> to define a Chemical (or Molecule) object.

There are a couple of things in your request.  The
first is how to represent the compound in Python,
and the second is how to store the data.

I'll start with the second, since it's tied into
what you were thinking about.  There are two very
common ways to store a molecular structure, one
is with a matrix, where the element at (i,j) is
the bond connecting atom i and j.  In the simpliest
case, where you are interested in only bond order,
this is an integer.

This approach takes order N*N memory, and is only
used in small molecules, say, a few hundred atoms.
You end up with a very sparse matrix - since most
organic systems only have at most 4 bonds the
number of bonds grows linearly.  (BTW, some organo-
metallic sytems can have over 10 bonds, but then,
as I understand it, the concept of bond isn't
really appropriate for those cases).

So most people use a list of atoms and something
to describe the bonds inbetween.  There are two
"somethings" in this case, depending on where
you want to put the bond information.

You consider bonds as single, double, triple.  Then
you want a list of atoms and a list of bonds.  The
atoms will have a name or atomic number field and
a list of the bonds used.  (Atom name is better
since that also works with atom types, later).
The bonds will have the bond type and the two atoms
used.

There are variations, like (also) storing atom
connections directly so you don't need the double
dereference to get to the other atom in a bond.

This viewpoint is more like how a chemist thinks
about chemisty.

However, many programs, like XPLOR, mentioned
elsewhere in this thread, ignore bonds other than
to say two atoms are connected.  Instead, the
information about the bond is made part of the
atom type.

Consider the cases O-C=C and C=C, where - is a single
bond and = is a double bond.  You can rewrite the
carbons as O-[C;sp]-[C;sp2] and [C;sp2]-[C;sp2]
(I'm not a chemist, so my sp's might be wrong here).
That is, you introduce atom types.  This is useful
because the dynamics of the two carbons are not
dependent on just the two carbons, but on the other
atoms bonded to the carbon.  The C=C of the first
molecule is different than the C=C of the second
one (examples in introductory chemistry classes,
which say "the energy of a double bonded carbon-
carbon is ..." not withstanding).

It's people like me (ex-physicist) and biologists
who use this model.  I worked on several successful
projects (VMD and NAMD from
http://www.ks.uiuc.edu/Research/) which never even
mention bond types.  I don't have MMTK handy, but
it is definitely atom type-based, and I don't think
it has a definition for bond type.  The most common
file format for large-molecule chemistry (meaning
protein sized) is the PDB format, and it doesn't
have bond types either.

So the model you use is going to depend on your
needs.  It sounds like you are just starting on
computational chemisty, so you probably want the
chemist's way.

Here's something like my standard data structure
for that case:

class Molecule:
  def __init__(self, atoms = None, bonds = None):
    self.atoms = atoms
    self.bonds = bonds

class Atom:
    def __init__(self, type):
        self.bonds = []
        self.type = type   # type is name or atomic number
    def add_bond(self, bond):
        self.bonds.append(bond)

class Bond:
    def __init__(self, bondtype, atom1, atom2):
        self.bondtype = bondtype
        self.atoms = (atom1, atom2)
        atom1.add_bond(self)
        atom2.add_bond(self)

This is useful for molecules which are not being
modified.  To use it, you do something like:
O = Atom("O")
C = Atom("C")
H1 = Atom("H")
H2 = Atom("H")
bonds = [Bond(1, C, H1), Bond(1, C,H2), Bond(2, C1, O)]
mol = Molecule([O, C, H1, H1], bonds)

You won't be doing molecular editing any time
soon, so this should get you most of what you need.
Oh, and BTW, this is highly cyclical, which means
you really don't want a pointer back from the atoms
or bonds to the molecule, since otherwise you'll
leak memory unless you explictly free things up.

The other part of your question is about how
to make the data structure in the first place.
You mentioned two ways, a dictionary and a connection
table.

Dictionary:
> formaldehyde = Chemical({h1:[c], h2:[c],
>  c:[h1, h2, o, o], o:[c, c]})

Connection table:
> formaldehyde = Chemical([H, H, C, O],
>    [[None, 0, 1, 0],
>     [0, None, 1, 0],
>     [1, 1, None, 2],
>     [0, 0, 2, None]])

The problem with both of these approaches is you
need the run-time check to make sure that of
X is bonded to Y then Y is bonded to X.  That's
why I have the connections done in the Bond
constructor.

The dictionary way is clever in how you repeat
occurances to mean the bond type.  I mentioned the
PDB format earlier.  It doesn't have bond types,
but there is a similar hack to add it by counting
how many times two atoms are CONECT'ed.

But it doesn't work for at least two reasons.
First, some bonds have partial bond order.
Aromatic bonds are sometimes said to have a
bond order of 1.5.  To support this, you'll end
up having to encode it as, say, "4 repeats means
aromatic bond".  Second, you can only store
one data type.  But bonds can have other properties,
like chirality - is the bond a wedge into or out
of the page?

More useful for what you're doing, you could have a
second parameter called "hcount" which is the number
of hydrogens bonded to the given atom.  That would
reduce the number of atoms you dealt with by about 1/2.

The second doesn't work for the same reason you
don't usually want to use a matrix for the internal
data structure - you end up with a very large
matrix filled with mostly zeros.  (Though if you
do quantum chemisty, which deals with small molecules,
you'll often see these sorts of matrix formats.)

There is another solution you should look at.  There
are so-called "line notations" which represent the
molecule as a string.  To make the data structure,
you write a parser for notation language.

The two most common of these are SMILES and SLN.
I think SMILES is the prettier of the two, so I'll
point you to www.daylight.com/smiles for the
description and just say SLN is described at
www.tripos.com.  In that case, your formaldehyde
is represented as "C=O" (the hydrogens are implicit).
Benzene is "c1ccccc1".  These are much easier to
read, and take smaller space.

The problem with line notations is it's harder
to write parser for them since not only is the
language more complicated than file formats like
mdli.com's "mol" format, but they often need more
chemical knowledge, such as knowing that the "C"
in "C=O" really needs two other hydrogens.  It's
also hard to store other information with the
atoms or bonds, like 3D coordinates.

I've been working on a Python interface to the
Daylight toolkit.  This exposes their C-based API
into an object-oriented form.  If you really want
to get into molecular editing, you should take
a look at it.  It's called PyDaylight and more info
is at starship.python.net/crew/dalke/PyDaylight/ .
It won't tell you how to write the data structure,
but it does explicitly describe how I think you
should interface to such a structure.

I've also got a partial Python parser for SMILES,
umm, somewhere.  I can't seem to find it now.  It
won't be finished any time soon since I'm working
on other projects.

There's at least one project, Klotho from wustl.edu,
which uses a Prolog description of a molecule.  As
I recall, they build up the molecule not from scratch
but from smaller parts.  But you have to be a chemist
for those parts to make intuitive sense.  Also of
likely interest for you is their database of small
molecules, like aspirin and caffeine.

Interestingly, this is similar to the approach
MMTK uses for describing an animo acid residue.  Those
parts mades sense to me because I worked on proteins,
and because there's only 20 some-odd parts to know.

Going farther afield, most of the academic and commercial
software uses a covalent bond model, where atoms are
connected by bonds.  When you get into more bizarre
chemistry, this breaks down.  For example, you can
have bonds between an atom and a ring system as a whole.
I don't know much about this sort of chemistry.

And then there's quantum chemists who have rather
different views on the subject, since bonds are only
an approximation of wave function densities.

If you want to look at existing code, you should look
at JMol from .. I know Christoph Steinbeck(?) in Germany
is working on it.  Don't have the URL handy and I'm not
connected, but it's described in freshmeat.net.  It's
a GPL set of chemical software for Java.  Rasmol also
store bond types, and is in C.  There are a few people
working with chemical compounds in Python, but most of
the ones I know of deal with large molecule modeling,
and either don't do bond types or don't have source code
they can make available.

This, by the way, was my description of data structures
of generic molecules.  When you have proteins and DNA,
you have much higher levels of structure, such as chains.
Often people will use a hierarchical data structure for
these "oh, atoms are in residues are in chains", but
this doesn't work.

I can leave that discussion for another time.  :)

Good luck!

                        Andrew Dalke
                        dalke at acm.org