Parsing a file based on differing delimiters

Wed Oct 22 07:02:29 EDT 2003

On 21 Oct 2003 15:21:13 -0700, kylotan at hotmail.com (Kylotan) wrote:

>I have a text file where the fields are delimited in various different
>ways. For example, strings are terminated with a tilde, numbers are
>terminated with whitespace, and some identifiers are terminated with a
>newline. This means I can't effectively use split() except on a small
>scale. For most of the file I can just call one of several functions I
>wrote that read in just as much data as is required from the input
>string, and return the value and modified string. Much of the code
>therefore looks like this:
>
>filedata = file('whatever').read()
>firstWord, filedata = GetWord(filedata)
>nextNumber, filedata = GetNumber(filedata)
>
>This works, but is obviously ugly. Is there a cleaner alternative that
>can avoid me having to re-assign data all the time that will 'consume'
>the value from the stream)? I'm a bit unclear on the whole passing by
>value/reference thing. I'm guessing that while GetWord gets a
>reference to the 'filedata' string, assigning to that will just reseat
>the reference and not change the original string.
>
>The other problem is that parts of the format are potentially repeated
>an arbitrary number of times and therefore a degree of lookahead is
>required. If I've already extracted a token and then find out I need
A generator can look ahead by holding put-back info in its own state
without yielding a result until it has decided what to do. It can read
input line-wise and scan lines for patterns and store ambiguous info
for re-analysis if backup is needed. You can go character by character
or whip through lines of comments in bigger chunks, and recognize alternative
patterns with regular expressions. There are lots of options.

>it, putting it back is awkward. Yet there is nowhere near enough
A generator wouldn't have to put it back, but if that is a convenient way to
go, you can define one with a put-back stack or queue by including a mutable
for that purpose as one of the initial arguments in the intial generator call.

>complexity or repetition in the file format to justify a formal
>grammar or anything like that.

Communicating clearly and precisely should be more than enough justification IMO ;-)

What you've said above sounds like approximately:

    kylotan_file: ( string_text '~' | number WS | some_identifiers NL )*

If it's not that complicated, why not complete the picture? I'd bet you'll get several
versions of tokenizers/parsers for it, and questions as to what you want to do with the
pieces. Maybe a tokenizer as a generator that gives you a sequence of (token_type, token_data)
tuples would work. If you have nested structures, you can define start-of-nest and end-of-nest
tokens as operator tokens like ( OP, '(' ) and ( OP ')' ) 

Look and Andrew Dalke's recent post for a number of ideas and code you might snip and adapt
to your problem (I think this shortened url will get you there):

    http://groups.google.com/groups?q=rpn.compile+group:comp.lang.python.*&hl=en&lr=&ie=UTF-8

>
>All in all, in the basic parsing code I am doing a lot more operations
>on the input data than I would like. I can see how I'd encapsulate
>this behind functions if I was willing to iterate through the data
>character by character like I would in C++. But I am hoping that
>Python can, as usual, save me from the majority of this drudgery
>somehow.
I suspect you could recognize bigger chunks with regular expressions, or at
least split them apart by splitting on a regex of delimiters (which you can
preserve in the split list by enclosing in parens).

>
>Any help appreciated.
>
HTH

Regards,
Bengt Richter