Which is the better way to parse this file?

Terry Reedy tjreedy at udel.edu
Tue Sep 2 09:57:22 EDT 2003


"Roberto A. F. De Almeida" <roberto at dealmeida.net> wrote in message
news:10c662fe.0309020436.559d513d at posting.google.com...
> I'm interested in parsing a file containing this "structure":
>
> """dataset {
>    int catalog_number;
>    sequence {
>       string experimenter;
>       int32 time;
>       structure {
>          float64 latitude;
>          float64 longitude;
>       } location;
>       sequence {
>          float depth;
>          float temperature;
>       } xbt;
>    } casts;
> } data;"""

I suspect that what you actually want to do is parse structures 'like'
the above, as defined be a grammar not shown ;-)

You did not specify whether you will get such files from an
uncontrolable external source or whether you control the input format.
If the later, there is no obvious reason for separate database,
sequence, and structure productions since all three result in
dictionaries with no functional difference.

> I want to obtain a dictionary like this:
>
> >>> pprint.pprint(data)
> {'casts': {'experimenter': None,
>            'location': {'latitude': None, 'longitude': None},
>            'time': None,
>            'xbt': {'depth': None, 'temperature': None}},
>  'catalog_number': None}
> The values ('None') will be filled later.

Using None as placeholders either tosses the type information or
requires that it be recorded elsewhere.  Use the int and float type
objects instead.  Note that standard Python cannot differentiate
between float and float64.

> I tried to do the parsing
> using regular expressions, but things became too complicated.

REs are great for linear repetition but not for indefinite nesting.

> I had
> more success using SimpleParse, but I'm interested in more insights
on
> different ways of parsing this file.

I know nothing of SimpleParse (and therefore, of what would be
different).  If the grammar is as simple as I infer from the sample -- 
dataset and sequences containing sequences, structures, and types -- I
would reread about recursive-descent parsing and maybe try that.  The
type_entry function would return a (name, typeobject) pair and the
structure, sequence, and database functions a (name, dict) pair.

But as hinted above, I would think about simplifying the grammar
before worryinng about parsing.  If you only have sequences of
sequences and type entries, parsing is trivial.

Terry J. Reedy






More information about the Python-list mailing list