[Pythonmac-SIG] Reading big text files

Oliver Steele steele@cs.brandeis.edu
Fri, 11 Jun 1999 19:09:08 -0400


Pieter Claerhout <chill@mediaport.org> writes:

> Everything works fine, but with big files (more than 5 MB, and most of my
> PostScript files are bigger than that), I run into memory problems.

and:

> I also have some problems with
> the different line-ending style from the different platforms. How should I
> handle this??

I've run into exactly these problems, and I found myself needing to go a bit
beyond what the other respondents have so helpfully suggested.

If you know in advance that the file uses UNIX or DOS line separators, you
can open the file with 'rb' instead of 'r' and then trim the string, as
Joseph J. Strout suggested.  Unfortunately, file.readline() on a MacOS file
opened with 'rb' will read the whole file, so you've fixed the problem for
UNIX/DOS files but you're back to where you started for MacOS files.

The textopen module, at
http://www.cs.brandeis.edu/~steele/sources/textopen.py, creates a file-like
object that recognizes all line-ending styles.  Add:
  from textopen import textopen
and replace
  fp = open(pathname, 'r')
with
  fp = textopen(pathname, 'r')
and your code will work regardless of the line-ending style.

If you use textopen, though, you can't use the fileinput module to read one
line at a time as Corran Webster suggested.  (fileinput.input() expects a
string, whereas textopen() returns a file-like object.)  You can use the
textline module, at http://www.cs.brandeis.edu/~steele/sources/textlines.py:
  from textopen import textopen
  from textlines import textlines
  ...
  fp = textopen(pathname):
  for x in textlines(fp):
    ...