[Python-Dev] Better text processing support in py2k?

M.-A. Lemburg mal@lemburg.com
Fri, 31 Dec 1999 12:18:57 +0100


Tim Peters wrote:
> 
> [M.-A. Lemburg]
> > What is QIO ?
> 
> See DejaNews (I don't save URLs).  "Quick" line-oriented text input adapted
> from INN.  Someone rewrote that as a Python extension module.

Ok, thanks.
 
> >>     http://www.rebol.com/faq.html#11550948
> 
> > Looks nice indeed, but how does executable code fit into
> > that definition ?
> 
> See the URL above I didn't save <wink>.  PARSE's "pattern" argument is a
> block.  Blocks can be (& often are) nested.  Whether any given block is code
> or data is all the same to REBOL, so passing nested code blocks in PARSE's
> pattern argument is easy.  Because blocks are lexically scoped, assignments
> (etc) inside a block are (well, can be) visible to its context; etc.  It's a
> very Lispish approach.  REBOL is essentially Scheme under the covers, but
> with syntax much more like Forth's (whitespace-separated strings of
> arbitrary non-whitespace characters, with few pre-assigned meanings or
> restrictions -- in fact, it's impossible for a compiler to determine where a
> REBOL function call begins or ends!  can't be known until runtime).

If I understand the concept correctly, I think Python could do
pretty much the same thing. The bummer is of course the need
for new keywords and byte codes (although these could be
split out into a separate text scanning engine). Using Python
function calls would slow down things to an extent that would
render the added functionality useless, well IMHO anyways ;-)

> > (mxTextTools allows you to write your own parsing elements
> > in Python, BTW; it should be possible to use those mechanisms
> > to achieve a similar intergration.)
> 
> It can't capture the flavor -- although I don't know that it needs to
> <wink>.  There's no distinction between "the pattern language" and "the
> computational language" in REBOL or Icon, and it's hard to explain what a
> maddening distinction that can be once you've lived without it.  mxTextTools
> embedding would feel more like Icon, where the matching engine is fully
> exposed to the programmer (REBOL hides it, allowing only "approved"
> interactions).

Of course its hard for a Turing Machine to capture the flavor
of any high level language :-) When you're programming
the mxTextTools Tagging Engine directly you feel like writing
assembler... but things are moving in the right direction:
Tony Ibbs has a nice meta-language and M.C. Fletcher his
SimpleParse to cover up these insufficiencies.
 
> >> OTOH, making lots of calls to analyze short strings is slow.
> 
> > That's why mxTextTools converts these search idioms into byte
> > codes which it executes at C level. Some future version will
> > even "precompile" the tuple input and then omit the type checks
> > during the search...that should give another noticeable speedup.
> > Note that recursion etc. can be done at C level too -- Python
> > function calls are not needed.
> 
> That's also the curse of having distinct languages; e.g., Python already had
> recursion, but you needed to reimplement it in a different way with
> different syntax and different rules in your pattern language.  In Icon etc,
> there's no difference between a recursive pattern and a recursive function,
> except in *what* it computes.  The machinery is all the same, and both more
> powerful and easier to learn because of that.

Agreed.
 
> > ...
> > Just for kicks, here is the mysplit() function using mxTextTools:
> >
> > from mx.TextTools import *
> >
> > table = (
> >     # Match all whitespace
> >     (None,AllInSet,whitespace_set,+1),
> >     # Match and tag all non-whitespace
> >     ('text',AllInSet + AppendMatch,nonwhitespace_set,+1),
> >     # Loop until EOF
> >     (None,EOF,Here,-2),
> >     )
> >
> > def mysplit(text):
> >
> >     return tag(text,table)[1]
> >
> > The timings:
> >  mysplit: 5.84 sec.
> >  string.split: 3.62 sec.
> >
> > Note that you can customize the above to split text at any
> > character set you like, not just whitespace... without
> > compiling or writing C code.
> 
> That's equally true of the example I posted <wink>.  Now what if I wanted to
> stop splitting right after I find a keyword, recognized as such because it's
> a key in some passed-in dictionary?  In my example, I make an obvious local
> code change, from
> 
>     while s.notmany(white):  # consume non-whitespace
>         result.append(s.get_match())
>         s.many(white)
> 
> to
> 
>     while s.notmany(white):  # consume non-whitespace
>         word = s.get_match()
>         result.append(word)
>         if dictionary.has_key(word):
>             break
>         s.many(white)
> 
> What does it do to your example? 

You'd replace the 'text' tagobj with a callable object and
write AllInSet + CallTag as command. The Tagging Engine will
then call the object with arguments (taglist,text,l,r,subtags)
and let it decide what to do.

In your example it would check the dictionary and raise an
exception in case a keyword is found to stop any further
scanning. If it's not a keyword, it would simply append
the found string to the taglist and return None.

Here's the code:

from mx.TextTools import *

import exceptions

stoplist = {'abc':1, 'def':1}

class KeywordFound(exceptions.StandardError):
    def __init__(self, taglist):
        self.taglist = taglist

def callable(taglist,text,l,r,subtags):

    taglist.append(text[l:r])
    if stoplist.has_key(text[l:r]):
        raise KeywordFound(taglist)

table = (
    # Match all whitespace
    (None,AllInSet,whitespace_set,+1),
    # Match and tag all non-whitespace
    (callable,AllInSet + CallTag,nonwhitespace_set,+1),
    # Loop until EOF
    (None,EOF,Here,-2),
    )

def mysplitex(text):

    try:
        return tag(text,table)[1]
    except KeywordFound,data:
        return data.taglist

> Or what if the target string isn't "a
> string" (the code I posted only assumes the "str" object responds to
> indexing and slicing -- any buffer object is fine -- so my example doesn't
> change at all)? 

The current version only handles string objects, but I am
already beginning to convert all the APIs in mxTextTools to
"s#" or "t#" style (can't decide which to use... "s#" is great
for processing raw data, while "t#" more closely refers to
text processing).

> Or what if you need to pass the tokens on as they're found,
> pipeline style?  Etc.  This is why I do complex string processing in Icon
> <0.9 wink>.

You can have all that extra magic via callable tag objects
or callable matching functions. It's not exactly nice to
write, but I'm sure that a meta-language could do the 
conversions for you.
 
> OTOH, at what it does well, mxTextTools runs quicker than Icon.  Its biggest
> problem has always been that e.g. nobody knows what the hell
> 
>      (None,EOF,Here,-2),
> 
> *means* at first glance -- or third <wink>.

The structure of those tag tables is very simple:

(tagobject, command, argument[, jump offset in case of failure
			     [, jump offset in case of success]])
                               
Please remember that this is byte code, not some higher level
abstraction. The design is very much inverted from what you'd
usually do: design a nice language and then try to find suitable
set of byte codes to make it work as intended.

Anyway, I'll keep focussing on the speed aspect of mxTextTools;
others can focus on abstractions, so that eventually everybody
will be happy :-)

Happy New Year,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                            Get ready to party !
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/