Which is the better way to parse this file?

Roberto A. F. De Almeida roberto at dealmeida.net
Tue Sep 2 13:09:25 EDT 2003


"Terry Reedy" <tjreedy at udel.edu> wrote in message news:<au2dnSnl__hyPMmiU-KYgw at comcast.com>...
> I suspect that what you actually want to do is parse structures 'like'
> the above, as defined be a grammar not shown ;-)

Yes, you're right. :)

The grammar is not complex, but I'm still struggling to process the
result tree.

> You did not specify whether you will get such files from an
> uncontrolable external source or whether you control the input format.
> If the later, there is no obvious reason for separate database,
> sequence, and structure productions since all three result in
> dictionaries with no functional difference.

This is a Dataset Descriptor for the Data Access Protocol
(http://www.unidata.ucar.edu/packages/dods/design/dap-rfc-html/), an
API to access remote datasets. DAP servers describe their datasets
using this grammar, and I'm developing a module to access DAP servers.

> > I want to obtain a dictionary like this:
> >
> > >>> pprint.pprint(data)
> > {'casts': {'experimenter': None,
> >            'location': {'latitude': None, 'longitude': None},
> >            'time': None,
> >            'xbt': {'depth': None, 'temperature': None}},
> >  'catalog_number': None}
> > The values ('None') will be filled later.
> 
> Using None as placeholders either tosses the type information or
> requires that it be recorded elsewhere.  Use the int and float type
> objects instead.  Note that standard Python cannot differentiate
> between float and float64.

Ok. One of the strong points of DAP is that data is retrieved only for
your region/period of interest. I created a class and redefined
__getitem__ so that data is only retrieved from the server when the
object is sliced.

>>> data = file("http://dods.gso.uri.edu/cgi-bin/nph-nc/data/fnoc1.nc")
>>> print data.variables['lat'].shape
(17,)
>>> print data.variables['lat'][1:4]   # only this subset is retrieved
[ 47.5  45.   42.5  40. ]

> I know nothing of SimpleParse (and therefore, of what would be
> different).  If the grammar is as simple as I infer from the sample -- 
> dataset and sequences containing sequences, structures, and types -- I
> would reread about recursive-descent parsing and maybe try that.  The
> type_entry function would return a (name, typeobject) pair and the
> structure, sequence, and database functions a (name, dict) pair.

Yes, it's very simple. As you see, even a structure is identical to a
sequence. The declarations are basically "types" or declarations
containing "types". Do you think it can be done without 3rd party
modules?

> But as hinted above, I would think about simplifying the grammar
> before worryinng about parsing.  If you only have sequences of
> sequences and type entries, parsing is trivial.

I'll take a look in that. Thanks very much for the insights.

Regards,

Roberto




More information about the Python-list mailing list