Record seperator
ChasBrown
cbrown at cbrownsystems.com
Sat Aug 27 14:40:09 EDT 2011
On Aug 27, 10:45 am, Roy Smith <r... at panix.com> wrote:
> In article <4e592852$0$29965$c3e8da3$54964... at news.astraweb.com>,
> Steven D'Aprano <steve+comp.lang.pyt... at pearwood.info> wrote:
>
> > open("file.txt") # opens the file
> > .read() # reads the contents of the file
> > .split("\n\n") # splits the text on double-newlines.
>
> The biggest problem with this code is that read() slurps the entire file
> into a string. That's fine for moderately sized files, but will fail
> (or at least be grossly inefficient) for very large files.
>
> It's always annoyed me a little that while it's easy to iterate over the
> lines of a file, it's more complicated to iterate over a file character
> by character. You could write your own generator to do that:
>
> for c in getchar(open("file.txt")):
> whatever
>
> def getchar(f):
> for line in f:
> for c in line:
> yield c
>
> but that's annoyingly verbose (and probably not hugely efficient).
read() takes an optional size parameter; so f.read(1) is another
option...
>
> Of course, the next problem for the specific problem at hand is that
> even with an iterator over the characters of a file, split() only works
> on strings. It would be nice to have a version of split which took an
> iterable and returned an iterator over the split components. Maybe
> there is such a thing and I'm just missing it?
I don't know if there is such a thing; but for the OP's problem you
could read the file in chunks, e.g.:
def readgroup(f, delim, buffsize=8192):
tail=''
while True:
s = f.read(buffsize)
if not s:
yield tail
break
groups = (tail + s).split(delim)
tail = groups[-1]
for group in groups[:-1]:
yield group
for group in readgroup(open('file.txt'), '\n\n'):
# do something
Cheers - Chas
More information about the Python-list
mailing list