[Python-Dev] Better text processing support in py2k?
Skip Montanaro
skip@mojam.com (Skip Montanaro)
Tue, 28 Dec 1999 15:41:33 -0600
It just occurred to me as I was replying to a request on the main list, that
Python's text handling capabilities could be a bit better than they are.
This will probably not come as a revelation to many of you, but I finally
put it together with the standard argument against beefing things up
One fix would be to add regular expressions to the language core and
have special syntax for them, as Perl has done. However, I don't like
this solution because Python is a general-purpose language, and regular
expressions are used for the single application domain of text
processing. For other application domains, regular expressions may be of
no interest, and you might want to remove them to save memory and code
size.
and the observation that Python does support some builtin objects and syntax
that are fairly specific to some much more restricted application domains
than text processing.
I stole the above quote from Andrew Kuchling's Python Warts page, which I
also happened to read earlier today.
What AMK says makes perfect sense until you examine some of the other things
that are in the language, like the Ellipsis object and complex numbers. If
I recall correctly both were added as a result of the NumPy package
development.
I have nothing against ellipses or complex numbers. They are fine first
class objects that should remain in the language. But I have never used
either one in my day-to-day work. On the other hand, I read files and
manipulate them with regular expressions all the time. I rather suspect
that more people use Python for some sort of text processing than any other
single application domain. Python should be good at it.
While I don't want to turn Python into Perl, I would like to see it do a
better job of what most people probably use the language for. Here is a
very short list of things I think need attention:
1. When using something like the simple file i/o idiom
for line in f.readlines():
dofunstuff(line)
the programmer should not have to care how big the file is. It
should just work in a reasonably efficient manner without gobbling up
all of memory. I realize this may require some change to the syntax
of the common idiom.
2. The re module needs to be sped up, if not to catch up with Perl, then
to catch up with the deprecated regex module. Depending how far
people want to go with things, adding some language syntax to support
regular expressions might be in order. I don't see that as
compelling as adding complex numbers however. Another possibility,
now that Barry Warsaw has opened the floodgates, is to add regular
expression methods to strings.
3. I've not yet used it, but I am told the pattern matching in
Marc-Andre Lemburg's mxTextTools
(http://starship.python.net/crew/lemburg/) is both powerful and
efficient (though it certainly appears complex). Perhaps it deserves
consideration for incorporation into the core Python distribution.
I'm sure other people will come up with other suggestions.
Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098 | Python: Programming the way Guido indented...