Record seperator

Sat Aug 27 14:40:09 EDT 2011

On Aug 27, 10:45 am, Roy Smith <r... at panix.com> wrote:
> In article <4e592852$0$29965$c3e8da3$54964... at news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.pyt... at pearwood.info> wrote:
>
> > open("file.txt")   # opens the file
> >  .read()           # reads the contents of the file
> >  .split("\n\n")    # splits the text on double-newlines.
>
> The biggest problem with this code is that read() slurps the entire file
> into a string.  That's fine for moderately sized files, but will fail
> (or at least be grossly inefficient) for very large files.
>
> It's always annoyed me a little that while it's easy to iterate over the
> lines of a file, it's more complicated to iterate over a file character
> by character.  You could write your own generator to do that:
>
> for c in getchar(open("file.txt")):
>    whatever
>
> def getchar(f):
>    for line in f:
>       for c in line:
>          yield c
>
> but that's annoyingly verbose (and probably not hugely efficient).

read() takes an optional size parameter; so f.read(1) is another
option...

>
> Of course, the next problem for the specific problem at hand is that
> even with an iterator over the characters of a file, split() only works
> on strings.  It would be nice to have a version of split which took an
> iterable and returned an iterator over the split components.  Maybe
> there is such a thing and I'm just missing it?

I don't know if there is such a thing; but for the OP's problem you
could read the file in chunks, e.g.:

def readgroup(f, delim, buffsize=8192):
    tail=''
    while True:
        s = f.read(buffsize)
        if not s:
            yield tail
            break
        groups = (tail + s).split(delim)
        tail = groups[-1]
        for group in groups[:-1]:
            yield group

for group in readgroup(open('file.txt'), '\n\n'):
    # do something

Cheers - Chas