Re: [Python-Dev] Better text processing support in py2k?

Dec. 30, 1999

      Tim Peters wrote:
...
[Skip Montanaro, wants nicer text facilities]
...
While I don't want to turn Python into Perl, I would like to see
it do a better job of what most people probably use the language
for.  Here is a very short list of things I think need attention:
1. [*A* clear way to do memory- and time-efficient textfile
        input]
...
The Python QIO extension module is much easier to port but less compatible
(it doesn't use stdio, so QIO-opened files don't play well with others) and
slower (although that's likely repairable -- he's got two passes over the
buffer where one hairier pass should suffice).
What is QIO ?
...
...
Depending how far people want to go with things, adding some
language syntax to support regular expressions might be in order.
...
    3. I've not yet used it, but I am told the pattern matching in
       Marc-Andre Lemburg's mxTextTools
      (http://starship.python.net/crew/lemburg/)
       is both powerful and efficient (though it certainly appears
       complex).  Perhaps it deserves consideration for
       incorporation into the core Python distribution.
It's not complex, it's complicated -- and *that's* what makes it un-Pythonic
<wink>.  Tony Ibbs has written a friendly wrapper around mxTextTools that
suppresses much of the non-essential complication.  OTOH, if you go into
this with a regexp mindset, it will run much slower than a real regexp
package, because the bulk of the latter is devoted to doing optimization;
mxTextTools is WYSIWYG (it screams if you code to its strengths, but crawls
if you e.g. try to implement naive backtracking).
All true. mxTextTools provides the tools, not the magic. But this
is also its strength: you can optimize the hell out of your particular
parsing requirement without having to think about how the RE optimizer
works.
...
You should go to the REBOL site and look at the description of REBOL's PARSE
verb in the FAQ ... mumble, mumble ... at
http://www.rebol.com/faq.html#11550948
Here's an example pulled from that page (this is a REBOL code fragment):
digit: charset "0123456789"
    expr: [term ["+" | "-"] expr | term]
    term: [factor ["*" | "/"] term | factor]
    factor: [primary "**" factor | primary]
    primary: [value | "(" expr ")"]
    value: [digit value | digit]
parse "1 + 2 ** 9" expr
There hasn't been a pattern scheme this clean, convenient or powerful since
SNOBOL4.  It exploits REBOL's Forth-like (lack of!) syntax, and
Smalltalk-like penchant for passing around thunks (anonymous closures --
"[...]" in REBOL builds a lexically-scoped entity called "a block", which
can be treated as code (executed) or data (manipulated like a Python list)
at will).
Looks nice indeed, but how does executable code fit into
that definition ? (mxTextTools allows you to write your own
parsing elements in Python, BTW; it should be possible to
use those mechanisms to achieve a similar intergration.)
...
...
BTW, the mxTextTools engine could be used to get blazing implementations of
the primary Searcher methods (it excels at simple analysis).  OTOH, making
lots of calls to analyze short strings is slow.
That's why mxTextTools converts these search idioms into byte codes
which it executes at C level. Some future version will even "precompile"
the tuple input and then omit the type checks during the search...
that should give another noticeable speedup. Note that recursion
etc. can be done at C level too -- Python function calls are not
needed.
...
The only clean solutions to
that are Perl's and Icon's (build everyting into one language so the
compiler can optimize stuff away), and REBOL's (make no distinction between
code and data, so that code can be analyzed & optimized at runtime -- and
build the entire implementation around making closures and calls
supernaturally fast).
Just for kicks, here is the mysplit() function using mxTextTools:

from mx.TextTools import *

table = (
    # Match all whitespace
    (None,AllInSet,whitespace_set,+1),
    # Match and tag all non-whitespace
    ('text',AllInSet + AppendMatch,nonwhitespace_set,+1),
    # Loop until EOF
    (None,EOF,Here,-2),
    )

def mysplit(text):

    return tag(text,table)[1]

The timings:
 mysplit: 5.84 sec.
 string.split: 3.62 sec.

Note that you can customize the above to split text at any
character set you like, not just whitespace... without
compiling or writing C code. The function mx.TextTools.setsplit()
provides this functionality as pure C function.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                            Get ready to party !
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/

Re: [Python-Dev] Better text processing support in py2k?

M.-A. Lemburg