How to read from a file to an arbitrary delimiter efficiently?

Chris Angelico rosuav at gmail.com
Sat Feb 27 07:17:36 EST 2016


On Sat, Feb 27, 2016 at 8:49 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Thu, 25 Feb 2016 06:30 pm, Chris Angelico wrote:
>
>> On Thu, Feb 25, 2016 at 5:50 PM, Steven D'Aprano
>> <steve+comp.lang.python at pearwood.info> wrote:
>>>
>>> # Read a chunk of bytes/characters from an open file.
>>> def chunkiter(f, delim):
>>>     buffer = []
>>>     b = f.read(1)
>>>     while b:
>>>         buffer.append(b)
>>>         if b in delim:
>>>             yield ''.join(buffer)
>>>             buffer = []
>>>         b = f.read(1)
>>>     if buffer:
>>>         yield ''.join(buffer)
>>
>> How bad is it if you over-read?
>
> Pretty bad :-)
>
> Ideally, I'd rather not over-read at all. I'd like the user to be able to
> swap from "read N bytes" to "read to the next delimiter" (and possibly
> even "read the next line") without losing anything.

If those are the *only* two operations, you should be able to maintain
your own buffer. Something like this:

class ChunkIter:
    def __init__(self, f, delim):
        self.f = f
        self.delim = re.compile("["+delim+"]")
        self.buffer = ""
    def read_to_delim(self):
        """Return characters up to the next delim, or remaining chars,
or "" if at EOF"""
        while "delimiter not found":
            *parts, self.buffer = self.delim.split(self.buffer, 1)
            if parts: return parts[0]
            b = self.f.read(256)
            if not b: return self.buffer
            self.buffer += b
    def read(self, nbytes):
        need = nbytes - len(self.buffer)
        if need > 0: self.buffer += self.f.read(need)
        ret, self.buffer = self.buffer[:need], self.buffer[need:]
        return ret

It still might over-read from the underlying file, but those extra
chars will be available to the read(N) function.

ChrisA


More information about the Python-list mailing list