[Tutor] What is the best way to count the number of lines in a huge file?

dman dsh8290@rit.edu
Thu, 6 Sep 2001 17:33:18 -0400


On Thu, Sep 06, 2001 at 05:26:00PM -0400, Ignacio Vazquez-Abrams wrote:
| On Thu, 6 Sep 2001, dman wrote:
| > On Thu, Sep 06, 2001 at 09:07:08AM -0400, Ignacio Vazquez-Abrams wrote:
| > | Fair enough:
| > |
| > | ---
| > | a=None
| > | n=0
| > | while not a='':
| > |   a=file.read(262144)
| > |   n+=a.count(os.linesep)
| > | ---
| >
| > The only problem with this is it only (truly properly) counts the
| > lines of files that were created with the same OS as the one
| > counting.
| >
| > You'd probably want to use a regex to search for "\r\n|\r|\n", but it
| > all depends on the source(s) of the files you want to count.
| >
| > Make your script "good enough", not "perfect according to some misty
| > definition".  :-).
| 
| Even using that RE (and in fact, of using the given code on Windows) is that
| the \r\n may straddle a 256k boundary, losing one line in the code, and
| gaining one with the RE.

Heh.  I didn't even know windows had a 256k boundary.  I always though
that putting an extra character in the file was a bad idea ... (it
also breaks ftell() and fseek()).  Ken Thompson and Dennis Ritchie had
more foresight than most ;-).

A more robust technique would be to scout the file's contents first
and determine whether it is *nix, 'doze or mac format and then search
only for that particular line terminator.  My regex would allow for a
combination of line separators in a single file, which isn't a good
idea.

-D