[Python-Dev] Better text processing support in py2k?

Skip Montanaro skip@mojam.com (Skip Montanaro)
Tue, 28 Dec 1999 15:41:33 -0600


It just occurred to me as I was replying to a request on the main list, that
Python's text handling capabilities could be a bit better than they are.
This will probably not come as a revelation to many of you, but I finally
put it together with the standard argument against beefing things up

    One fix would be to add regular expressions to the language core and
    have special syntax for them, as Perl has done. However, I don't like
    this solution because Python is a general-purpose language, and regular
    expressions are used for the single application domain of text
    processing. For other application domains, regular expressions may be of
    no interest, and you might want to remove them to save memory and code
    size.

and the observation that Python does support some builtin objects and syntax
that are fairly specific to some much more restricted application domains
than text processing.

I stole the above quote from Andrew Kuchling's Python Warts page, which I
also happened to read earlier today.

What AMK says makes perfect sense until you examine some of the other things
that are in the language, like the Ellipsis object and complex numbers.  If
I recall correctly both were added as a result of the NumPy package
development.

I have nothing against ellipses or complex numbers.  They are fine first
class objects that should remain in the language. But I have never used
either one in my day-to-day work.  On the other hand, I read files and
manipulate them with regular expressions all the time.  I rather suspect
that more people use Python for some sort of text processing than any other
single application domain.  Python should be good at it.

While I don't want to turn Python into Perl, I would like to see it do a
better job of what most people probably use the language for.  Here is a
very short list of things I think need attention:

    1. When using something like the simple file i/o idiom

       for line in f.readlines():
	   dofunstuff(line)

       the programmer should not have to care how big the file is.  It
       should just work in a reasonably efficient manner without gobbling up
       all of memory.  I realize this may require some change to the syntax
       of the common idiom.

    2. The re module needs to be sped up, if not to catch up with Perl, then
       to catch up with the deprecated regex module.  Depending how far
       people want to go with things, adding some language syntax to support
       regular expressions might be in order.  I don't see that as
       compelling as adding complex numbers however.  Another possibility,
       now that Barry Warsaw has opened the floodgates, is to add regular
       expression methods to strings.

    3. I've not yet used it, but I am told the pattern matching in
       Marc-Andre Lemburg's mxTextTools
       (http://starship.python.net/crew/lemburg/) is both powerful and
       efficient (though it certainly appears complex).  Perhaps it deserves
       consideration for incorporation into the core Python distribution.

I'm sure other people will come up with other suggestions.

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...