Tim Peters wrote:
[Skip Montanaro, wants nicer text facilities]
While I don't want to turn Python into Perl, I would like to see it do a better job of what most people probably use the language for. Here is a very short list of things I think need attention:
1. [*A* clear way to do memory- and time-efficient textfile input]
The Python QIO extension module is much easier to port but less compatible (it doesn't use stdio, so QIO-opened files don't play well with others) and slower (although that's likely repairable -- he's got two passes over the buffer where one hairier pass should suffice).
What is QIO ?
Depending how far people want to go with things, adding some language syntax to support regular expressions might be in order. ... 3. I've not yet used it, but I am told the pattern matching in Marc-Andre Lemburg's mxTextTools (http://starship.python.net/crew/lemburg/) is both powerful and efficient (though it certainly appears complex). Perhaps it deserves consideration for incorporation into the core Python distribution.
It's not complex, it's complicated -- and *that's* what makes it un-Pythonic <wink>. Tony Ibbs has written a friendly wrapper around mxTextTools that suppresses much of the non-essential complication. OTOH, if you go into this with a regexp mindset, it will run much slower than a real regexp package, because the bulk of the latter is devoted to doing optimization; mxTextTools is WYSIWYG (it screams if you code to its strengths, but crawls if you e.g. try to implement naive backtracking).
All true. mxTextTools provides the tools, not the magic. But this is also its strength: you can optimize the hell out of your particular parsing requirement without having to think about how the RE optimizer works.
You should go to the REBOL site and look at the description of REBOL's PARSE verb in the FAQ ... mumble, mumble ... at
Here's an example pulled from that page (this is a REBOL code fragment):
digit: charset "0123456789" expr: [term ["+" | "-"] expr | term] term: [factor ["*" | "/"] term | factor] factor: [primary "**" factor | primary] primary: [value | "(" expr ")"] value: [digit value | digit] parse "1 + 2 ** 9" expr
There hasn't been a pattern scheme this clean, convenient or powerful since SNOBOL4. It exploits REBOL's Forth-like (lack of!) syntax, and Smalltalk-like penchant for passing around thunks (anonymous closures -- "[...]" in REBOL builds a lexically-scoped entity called "a block", which can be treated as code (executed) or data (manipulated like a Python list) at will).
Looks nice indeed, but how does executable code fit into that definition ? (mxTextTools allows you to write your own parsing elements in Python, BTW; it should be possible to use those mechanisms to achieve a similar intergration.)
BTW, the mxTextTools engine could be used to get blazing implementations of the primary Searcher methods (it excels at simple analysis). OTOH, making lots of calls to analyze short strings is slow.
That's why mxTextTools converts these search idioms into byte codes which it executes at C level. Some future version will even "precompile" the tuple input and then omit the type checks during the search... that should give another noticeable speedup. Note that recursion etc. can be done at C level too -- Python function calls are not needed.
The only clean solutions to that are Perl's and Icon's (build everyting into one language so the compiler can optimize stuff away), and REBOL's (make no distinction between code and data, so that code can be analyzed & optimized at runtime -- and build the entire implementation around making closures and calls supernaturally fast).
Just for kicks, here is the mysplit() function using mxTextTools:
from mx.TextTools import *
table = ( # Match all whitespace (None,AllInSet,whitespace_set,+1), # Match and tag all non-whitespace ('text',AllInSet + AppendMatch,nonwhitespace_set,+1), # Loop until EOF (None,EOF,Here,-2), )
The timings: mysplit: 5.84 sec. string.split: 3.62 sec.
Note that you can customize the above to split text at any character set you like, not just whitespace... without compiling or writing C code. The function mx.TextTools.setsplit() provides this functionality as pure C function.