Parsing a file based on differing delimiters
Peter Otten
__peter__ at web.de
Wed Oct 22 04:11:12 EDT 2003
Kylotan wrote:
> I have a text file where the fields are delimited in various different
> ways. For example, strings are terminated with a tilde, numbers are
> terminated with whitespace, and some identifiers are terminated with a
> newline. This means I can't effectively use split() except on a small
> scale. For most of the file I can just call one of several functions I
> wrote that read in just as much data as is required from the input
> string, and return the value and modified string. Much of the code
> therefore looks like this:
>
> filedata = file('whatever').read()
> firstWord, filedata = GetWord(filedata)
> nextNumber, filedata = GetNumber(filedata)
>
> This works, but is obviously ugly. Is there a cleaner alternative that
> can avoid me having to re-assign data all the time that will 'consume'
> the value from the stream)? I'm a bit unclear on the whole passing by
> value/reference thing. I'm guessing that while GetWord gets a
> reference to the 'filedata' string, assigning to that will just reseat
> the reference and not change the original string.
The strategy to rebind is to wrap the reference into a mutable object and
pass that object around instead of the original reference.
> The other problem is that parts of the format are potentially repeated
> an arbitrary number of times and therefore a degree of lookahead is
> required. If I've already extracted a token and then find out I need
> it, putting it back is awkward. Yet there is nowhere near enough
> complexity or repetition in the file format to justify a formal
> grammar or anything like that.
>
> All in all, in the basic parsing code I am doing a lot more operations
> on the input data than I would like. I can see how I'd encapsulate
> this behind functions if I was willing to iterate through the data
> character by character like I would in C++. But I am hoping that
> Python can, as usual, save me from the majority of this drudgery
> somehow.
I've made a little Reader class that should do what you want. Of course the
actual parsing routines will differ, depending on your file format.
<code>
class EndOfData(Exception):
pass
class Reader:
def __init__(self, data):
self.data = data
self.positions = [0]
def _getChunk(self, delim):
start = self.positions[-1]
if start >= len(self.data):
raise EndOfData
end = self.data.find(delim, start)
if end < 0:
end = len(self.data)
self.positions.append(end+1)
return self.data[start:end]
def rest(self):
return self.data[self.positions[-1]:]
def rewind(self):
self.positions = [0]
def unget(self):
self.positions.pop()
def getString(self):
return self._getChunk("~")
def getInteger(self):
chunk = self._getChunk(" ")
try:
return int(chunk)
except ValueError:
self.unget()
raise
#example usage:
sample = "abc~123 456 rst"
r = Reader(sample)
commands = {
"i": r.getInteger,
"s": r.getString,
"u": lambda: r.unget() or "#unget " + r.rest(),
}
for key in "ssuiisuuisi":
try:
print commands[key]()
except ValueError:
print "#error"
</code>
Peter
More information about the Python-list
mailing list