[Tutor] Words alignment tool

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Mon Dec 5 05:26:35 CET 2005

On Sun, 4 Dec 2005, Srinivas Iyyer wrote:

> Contr1	SPR-10	SPR-101	SPR-125	SPR-137	SPR-139	SPR-143
> contr2	SPR-1	SPR-15  SPR-126	SPR-128	SPR-141	SPR-148
> contr3	SPR-106	SPR-130	SPR-135	SPR-138	SPR-139	SPR-145
> contr4	SPR-124	SPR-125	SPR-130	SPR-139	SPR-144	SPR-148

Hi Srinivas,

I'd strongly recommend changing the data representation from a
line-oriented to a more structured view.  Each line in your data above
appears to describe a conceptual set of tuples:

    (control_number, spr_number)

For example, we can think of the line:

    Contr1	SPR-10	SPR-101	SPR-125	SPR-137	SPR-139	SPR-143

as an encoding for the set of tuples written below (The notation I use
below is mathematical and not meant to be interpreted as Python.):

    { (Contr1, SPR-10),
      (Contr1, SPR-101),
      (Contr1, SPR-125),
      (Contr1, SPR-137),
      (Contr1, SPR-139),
      (Contr1, SPR-143) }

I'm not sure if I'm seeing everything, but from what I can tell so far,
your data cries out to be held in a relational database.  I agree with
Kent: you do not need to "align" anything.  If, within your sequence, each
element has to be unique in that sequence, then your "alignment" problem
transforms into a simpler table lookup problem.

That is, if all your data looks like:

    1: A B D E
    2: A C F
    3: A B C D

where no line can have repeated characters, then that data can be
transformed into a simple tablular representation, conceptually as:

        A   B   C   D   E   F
    1 | x | x |   | x | x |   |
    2 | x |   | x |   |   | x |
    3 | x | x | x | x |   |   |

So unless there's something here that you're not telling us, there's no
need for any complicated alignment algorithms: we just start off with an
empty table, and then for each tuple, check the corresponding entry in
the table.

Then when we need to look for common elements, we just scan across a row
or column of the table.  BLAST is cool, but, like regular expressions,
it's not the answer to every string problem.

If you want to implement code to do the above, it's not difficult, but you
really should use an SQL database to do this.  As a bioinformatician, it
would be in your best interest to know SQL, because otherwise, you'll end
up trying to reinvent tools that have already been written for you.

A good book on introductory relational database usage is "The Practical
SQL Handbook: Using Structured Query Language" by Judith Bowman, Sandra
Emerson, and Marcy Darnovsky.

Good luck to you.

More information about the Tutor mailing list