[Python-Dev] Better text processing support in py2k?

Skip Montanaro skip@mojam.com (Skip Montanaro)
Tue, 28 Dec 1999 16:26:53 -0600 (CST)


    Andrew> True, but note that you can compile Python with WITHOUT_COMPLEX
    Andrew> defined to remove complex numbers.

That's true, but that wasn't my point.  I'm not arguing for or against space
efficiency, just that the the rather timeworn argument about not doing
anything special to support text processing because Python is a general
purpose language is a red herring.

    >> 1. When using something like the simple file i/o idiom
    >> for line in f.readlines():
    >>   dofunstuff(line)
    >> the programmer should not have to care how big the file is.

    Andrew> What about 'for line in fileinput.input()', which already
    Andrew> exists?  (Hmmm... if you have an already open file object, I
    Andrew> don't think you can pass it to fileinput.input(); maybe that
    Andrew> should be fixed.)

Well, a couple reasons jump to mind:

   1. fileinput.FileInput isn't particularly efficient.  At its heart, its
      __getitem__ method makes a simple readline() call instead of buffering
      some amount of readlines(sizehint) bytes.  This can be fixed, but I'm
      not sure what would happen to its semantics.

   2. As you pointed out, it's not all that general.

My point, not at all well stated, is that the programmer shouldn't have to
worry (much?) about the conditions under which he does file i/o.   Right
now, if I know the file is small(ish), I can do

    for line in f.readlines():
        dofunstuff(line)

but I have to know that the file won't be big, because readlines() will
behave badly (perhaps even generate a MemoryError exception) if the file is
large.  In that case, I have to fall back to the safer (and slower)

    line = f.readline()
    while line:
        dofunstuff(line)
	line = f.readline()

or the more efficient, but more cumbersome

    lines = f.readlines(sizehint)
    while lines:
        for line in lines:
	    dofunstuff(line)
	lines = f.readlines(sizehint)

That's three separate idioms the programmer has to be aware of when writing
code to read a text file based upon the perceived need for speed, memory
usage and desired clarity:

    fast/memory-intensive/clear
    slow/memory-conserving/not-as-clear
    fast/memory-conserving/fairly-muddy

Any particular reason that the readline method can't return an iterator that
supports __getitem__ and buffers input?  (Again, remember this is for py2k,
so the potential breakage such a change might cause is a consideration, but
not a showstopper.)

    Andrew> On a vaguely related note, since there are many things like
    Andrew> parser generators and XML stuff and mxTextTools, I've been
    Andrew> speculating about a text processing topic guide.  If you know of
    Andrew> Python packages related to text processing, please send me a
    Andrew> private e-mail with a link.

This sounds like a good idea to me.

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...