Parsing a file based on differing delimiters

Wed Oct 22 04:11:12 EDT 2003

Kylotan wrote:

> I have a text file where the fields are delimited in various different
> ways. For example, strings are terminated with a tilde, numbers are
> terminated with whitespace, and some identifiers are terminated with a
> newline. This means I can't effectively use split() except on a small
> scale. For most of the file I can just call one of several functions I
> wrote that read in just as much data as is required from the input
> string, and return the value and modified string. Much of the code
> therefore looks like this:
> 
> filedata = file('whatever').read()
> firstWord, filedata = GetWord(filedata)
> nextNumber, filedata = GetNumber(filedata)
> 
> This works, but is obviously ugly. Is there a cleaner alternative that
> can avoid me having to re-assign data all the time that will 'consume'
> the value from the stream)? I'm a bit unclear on the whole passing by
> value/reference thing. I'm guessing that while GetWord gets a
> reference to the 'filedata' string, assigning to that will just reseat
> the reference and not change the original string.

The strategy to rebind is to wrap the reference into a mutable object and
pass that object around instead of the original reference.

> The other problem is that parts of the format are potentially repeated
> an arbitrary number of times and therefore a degree of lookahead is
> required. If I've already extracted a token and then find out I need
> it, putting it back is awkward. Yet there is nowhere near enough
> complexity or repetition in the file format to justify a formal
> grammar or anything like that.
> 
> All in all, in the basic parsing code I am doing a lot more operations
> on the input data than I would like. I can see how I'd encapsulate
> this behind functions if I was willing to iterate through the data
> character by character like I would in C++. But I am hoping that
> Python can, as usual, save me from the majority of this drudgery
> somehow.

I've made a little Reader class that should do what you want. Of course the
actual parsing routines will differ, depending on your file format.

<code>
class EndOfData(Exception):
    pass

class Reader:
    def __init__(self, data):
        self.data = data
        self.positions = [0]

    def _getChunk(self, delim):
        start = self.positions[-1]
        if start >= len(self.data):
            raise EndOfData
        end = self.data.find(delim, start)
        if end < 0:
            end = len(self.data)
        self.positions.append(end+1)
        return self.data[start:end]

    def rest(self):
        return self.data[self.positions[-1]:]
    def rewind(self):
        self.positions = [0]
    def unget(self):
        self.positions.pop()
    def getString(self):
        return self._getChunk("~")
    def getInteger(self):
        chunk = self._getChunk(" ")
        try:
            return int(chunk)
        except ValueError:
            self.unget()
            raise

#example usage:

sample = "abc~123 456 rst"
r = Reader(sample)

commands = {
    "i": r.getInteger,
    "s": r.getString,
    "u": lambda: r.unget() or "#unget " + r.rest(),
}

for key in "ssuiisuuisi":
    try:
        print commands[key]()
    except ValueError:
        print "#error"
</code>

Peter