[Patches] fileinput.py argument handling (and suggestion)

Greg Ward gward@mems-exchange.org
Tue, 11 Apr 2000 12:52:37 -0400


On 11 April 2000, Moshe Zadka said:
> I really like the text_file.py. Can't it just be in the standard library
> and have a__getitem__ method (or is there one already? I only looked at
> the doc you sent). This will at least solve the FAQ about reading a file
> line-by-line in a nice way.

No, there's no __getitem__.  I guess I could add one along the lines of
FileInput's; I didn't know about that class when I wrote my TextFile
class, and the idea of a strictly sequential __getitem__ didn't occur to
me independently.

The other problem, which I blithely assume is a problem with both the
standard FileInput and my TextFile, is performance.  Sometimes, you
just want to loop over all lines in a text file as quickly as possible,
doing something with each line.  Personally, I do those tasks using a
well-known and popular scripting language that was carefully optimized
for that very task.  Others in the Python community have expressed
distaste for that particular language, but the fact remains that P**l's
facilities for looping-over-all-lines-in-a-text-file-quickly are much
better than Python's.  You can write lovely high-level classes that
provide the same functionality as Perl's bag o' tricks, but the
performance will, presumably, stink.

[...considerable time passes...]

OK, I got curious: how bad *is* the performance of Python's regular file
input, the FileInput class, and my TextFile class?  My test data is the
entire Python 1.5.2 standard library, i.e. all 344 *.py files under
$prefix/lib/python1.5 (hmm, I guess that includes whatever I happen to
have in site-packages right now -- whatever).  The basic tasks are:

  1. read every line and discard it
  2. read all lines in one fell swoop and discard them
  3. read every line, stripping comments and whitespace and ignoring
     blanks... and then discard what's left (ie. we only exercise
     detecting blank lines)

I implemented each basic task in the most obvious and straightforward
way with Perl, and implemented task 1 and 3 in a couple of different
ways with Python.  Here are the results:

Script                 CPU time    Elapsed
readfile1.pl   (4 LoC)     0.27       0.44   "1 while <F>"
readfile1.py    6          3.72       4.37   "while 1: readline()"
readfile1a.py   4          7.34       8.24   FileInput (one per file)
readfile1b.py   3          7.20       8.07   FileInput (one for whole list)
readfile1c.py   5          8.50       9.53   TextFile (all options off)

readfile2.pl    4 LoC      0.45       0.63   "@lines = <F>"
readfile2a.pl   1          0.44       0.73   "@all_lines = <>"
readfile2.py    4          0.51       1.11   "lines = readlines()"

readfile3.pl    8 LoC      1.01       1.28   chomp, regex tricks to trim
readfile3.py    13         7.30       8.10   string.find, string.rstrip
readfile3a.py   6         11.09      12.24   TextFile (default options)

(Lines-of-code excludes comments, blanks, curly braces, and imports.)

Conclusion: FileInput and TextFile are indeed dogs, but TextFile isn't
enormously worse than FileInput when it's not doing any input-munging.
TextFile is considerably slower than rolling your own munge-every-line
code every time.  The fastest way to read a file in Python is with
".readlines()"; surprisingly, in Perl it's to read lines individually.

Does anybody care?  I do -- the reason I wrote TextFile was to process a
file that is currently ~450k (~20,000 lines); it currently takes ~30 sec
to process using CPython.  After this little benchmarking session, I'm
starting to wonder if there's a better way.  (What?!?  Write a C
extension to read a text file??!  ;-)

        Greg
-- 
Greg Ward - software developer                gward@mems-exchange.org
MEMS Exchange / CNRI                           voice: +1-703-262-5376
Reston, Virginia, USA                            fax: +1-703-262-5367