Parsing a file with iterators

Fri Oct 17 12:45:42 EDT 2008

On Fri, 17 Oct 2008 11:42:05 -0400, Luis Zarrabeitia wrote:

> I need to parse a file, text file. The format is something like that:
> 
> TYPE1 metadata
> data line 1
> data line 2
> ...
> data line N
> TYPE2 metadata
> data line 1
> ...
> TYPE3 metadata
> ...
> […]
> because when the parser iterates over the input, it can't know that it
> finished processing the section until it reads the next "TYPE" line
> (actually, until it reads the first line that it cannot parse, which if
> everything went well, should be the 'TYPE'), but once it reads it, it is
> no longer available to the outer loop. I wouldn't like to leak the
> internals of the parsers to the outside.
> 
> What could I do?
> (to the curious: the format is a dialect of the E00 used in GIS)

Group the lines before processing and feed each group to the right parser:

import sys
from itertools import groupby, imap
from operator import itemgetter

def parse_a(metadata, lines):
    print 'parser a', metadata
    for line in lines:
        print 'a', line

def parse_b(metadata, lines):
    print 'parser b', metadata
    for line in lines:
        print 'b', line

def parse_c(metadata, lines):
    print 'parser c', metadata
    for line in lines:
        print 'c', line

def test_for_type(line):
    return line.startswith('TYPE')

def parse(lines):
    def tag():
        type_line = None
        for line in lines:
            if test_for_type(line):
                type_line = line
            else:
                yield (type_line, line)

    type2parser = {'TYPE1': parse_a,
                   'TYPE2': parse_b,
                   'TYPE3': parse_c }

    for type_line, group in groupby(tag(), itemgetter(0)):
        type_id, metadata = type_line.split(' ', 1)
        type2parser[type_id](metadata, imap(itemgetter(1), group))

def main():
    parse(sys.stdin)