Simple Text Processing Help

Marc 'BlackJack' Rintsch bj_666 at
Mon Oct 15 13:20:47 CEST 2007

On Mon, 15 Oct 2007 10:47:16 +0000, patrick.waldo wrote:

> my sample input file looks like this( not organized,as you see it):
> 200-720-7        69-93-2
> kyselina mocová      C5H4N4O3
> 200-001-8       50-00-0
> formaldehyd      CH2O
> 200-002-3
> 50-01-1
> guanidínium-chlorid      CH5N3.ClH
> etc...

That's quite irregular so it is not that straightforward.  One way is to
split everything into words, start a record by taking the first two
elements and then look for the start of the next record that looks like
three numbers concatenated by '-' characters.  Quick and dirty hack:

import codecs
import re

NR_RE = re.compile(r'^\d+-\d+-\d+$')

def iter_elements(tokens):
    tokens = iter(tokens)
        nr_a =
        while True:
            nr_b =
            items = list()
            for item in tokens:
                if NR_RE.match(item):
                    yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])
                    nr_a = item
    except StopIteration:
        yield (nr_a, nr_b, ' '.join(items[:-1]), items[-1])

def main():
    in_file ='test.txt', 'r', 'utf-8')
    tokens =
    for element in iter_elements(tokens):
        print '|'.join(element)

	Marc 'BlackJack' Rintsch

