data parsing

Alex Martelli aleaxit at yahoo.com
Sat Feb 24 12:36:51 EST 2001


"Gnanasekaran Thoppae" <gnana at mips.biochem.mpg.de> wrote in message
news:mailman.982950184.14469.python-list at python.org...
> Hi,
>
> I have some data in a file 'test', which contains:
>
> Joe|25|30|49|40|
> |28|39|71||
> |30|29|||
> Malcolm|43|60|56||
> |28|37|||
    [snip]
> I want to parse this data and format it in this way:
>
> Joe|25;28;30|30;39;29|49;71|40|
> Malcolm|43;28|60;37|56||
> Amy|40;40|70;30|45;;30|;40;|
>
> Basically speaking, I am trying to cluster multi record
> data into one field, each seperated by a delimiter ';' and
> if the field is empty, an empty ; will enable later on to
> decode the field as empty field ''.

Problems of data parsing and reformatting are often quite
interesting.  Here, of course, the key is the intermediate,
internal format, in which a 'record' will include the name
and the list-of-lists of data items; all we want is a way to
form such structures from parsing the specified input file
format, and a way to output them into the requested form.

Output of a record can only be performed when it's all in,
which means either that the next record has started, or
that the whole file is over.  A top-down design, then, could
start with the outline:

def reformat(infileob, oufileob):
    data = None
    for line in infileob.lines():
        if line.startswith('|'):
            add_line(data, line)
        else:
            if data is not None:
                output_data(data, oufileob)
            data = new_data(line)
    output_data(data, oufileob)

All we need to do, then, is define more specifically what
we want to do in functions new_data, add_line, and
output_data.  This could be a good occasion to switch to
object-oriented design, making these into the constructor
and two methods of an appropriate class; the outline
would change only slightly:

def reformat(infileob, oufileob):
    data = None
    for line in infileob.lines():
        if line.startswith('|'):
            data.add_line(line)
        else:
            if data is not None:
                data.emit_to(oufileob)
            data = Data(line)
    data.emit_to(oufileob)

It's just a matter of style. In the *implementation* of
the constructor, mutator, and emittor, designing the
data object as a class instance would let us have named
fields as object attributes; since, here, we only really
need two fields (the name, and the list-of-lists of data
items), the advantage is not very substantial -- we may
as well use a two-items list.

Going back to the first outline, then, we could have...:

def new_data(line):
    fields = line.split('|')
    return ( fields[0], [ [field] for field in fields[1:-1] ] )

def add_line(data, line):
    fields = line.split('|')
    for fieldlist, newfield in zip(data[1], fields[1:-1]):
        fieldlist.append(newfield)

def output_data(data, oufileob):
    oufileob.write(data[0]+'|')
    for fieldlist in data[1]:
        oufileob.write(';'.join(fieldlist)+'|')
    oufileob.write('\n')

This will give us more regular output than your example
implicitly specifies, with leading and trailing semicolons
in the same numbers for each case (it's not clear to me
according to which rule you have them in some cases and
not in others), but when you test and adjust this code you
can no doubt implement the exact rules you desire, too.

The split and join method of string objects are what we
are mainly using here, of course; plus a few methods
(append) and operations on tuples (the data object
itself), lists (data[1] being a list of lists of strings),
strings (basically just + to concatenate them) and file
objects (just the write method).  Oh, and, a list
comprehension in new_data, and a little indexing and
slicing.  If any of these Python constructs and idioms
is not fully clear to you, it's possible to rephrase quite
a few of them in other ways (not quite as concise, and
thus, maybe, easier to understand), and anyway, of
course, we're always here for explanations...!


Alex






More information about the Python-list mailing list