
Tim Peters wrote:
[M.-A. Lemburg]
What is QIO ?
See DejaNews (I don't save URLs). "Quick" line-oriented text input adapted from INN. Someone rewrote that as a Python extension module.
Ok, thanks.
Looks nice indeed, but how does executable code fit into that definition ?
See the URL above I didn't save <wink>. PARSE's "pattern" argument is a block. Blocks can be (& often are) nested. Whether any given block is code or data is all the same to REBOL, so passing nested code blocks in PARSE's pattern argument is easy. Because blocks are lexically scoped, assignments (etc) inside a block are (well, can be) visible to its context; etc. It's a very Lispish approach. REBOL is essentially Scheme under the covers, but with syntax much more like Forth's (whitespace-separated strings of arbitrary non-whitespace characters, with few pre-assigned meanings or restrictions -- in fact, it's impossible for a compiler to determine where a REBOL function call begins or ends! can't be known until runtime).
If I understand the concept correctly, I think Python could do pretty much the same thing. The bummer is of course the need for new keywords and byte codes (although these could be split out into a separate text scanning engine). Using Python function calls would slow down things to an extent that would render the added functionality useless, well IMHO anyways ;-)
(mxTextTools allows you to write your own parsing elements in Python, BTW; it should be possible to use those mechanisms to achieve a similar intergration.)
It can't capture the flavor -- although I don't know that it needs to <wink>. There's no distinction between "the pattern language" and "the computational language" in REBOL or Icon, and it's hard to explain what a maddening distinction that can be once you've lived without it. mxTextTools embedding would feel more like Icon, where the matching engine is fully exposed to the programmer (REBOL hides it, allowing only "approved" interactions).
Of course its hard for a Turing Machine to capture the flavor of any high level language :-) When you're programming the mxTextTools Tagging Engine directly you feel like writing assembler... but things are moving in the right direction: Tony Ibbs has a nice meta-language and M.C. Fletcher his SimpleParse to cover up these insufficiencies.
OTOH, making lots of calls to analyze short strings is slow.
That's why mxTextTools converts these search idioms into byte codes which it executes at C level. Some future version will even "precompile" the tuple input and then omit the type checks during the search...that should give another noticeable speedup. Note that recursion etc. can be done at C level too -- Python function calls are not needed.
That's also the curse of having distinct languages; e.g., Python already had recursion, but you needed to reimplement it in a different way with different syntax and different rules in your pattern language. In Icon etc, there's no difference between a recursive pattern and a recursive function, except in *what* it computes. The machinery is all the same, and both more powerful and easier to learn because of that.
Agreed.
... Just for kicks, here is the mysplit() function using mxTextTools:
from mx.TextTools import *
table = ( # Match all whitespace (None,AllInSet,whitespace_set,+1), # Match and tag all non-whitespace ('text',AllInSet + AppendMatch,nonwhitespace_set,+1), # Loop until EOF (None,EOF,Here,-2), )
def mysplit(text):
return tag(text,table)[1]
The timings: mysplit: 5.84 sec. string.split: 3.62 sec.
Note that you can customize the above to split text at any character set you like, not just whitespace... without compiling or writing C code.
That's equally true of the example I posted <wink>. Now what if I wanted to stop splitting right after I find a keyword, recognized as such because it's a key in some passed-in dictionary? In my example, I make an obvious local code change, from
while s.notmany(white): # consume non-whitespace result.append(s.get_match()) s.many(white)
to
while s.notmany(white): # consume non-whitespace word = s.get_match() result.append(word) if dictionary.has_key(word): break s.many(white)
What does it do to your example?
You'd replace the 'text' tagobj with a callable object and write AllInSet + CallTag as command. The Tagging Engine will then call the object with arguments (taglist,text,l,r,subtags) and let it decide what to do. In your example it would check the dictionary and raise an exception in case a keyword is found to stop any further scanning. If it's not a keyword, it would simply append the found string to the taglist and return None. Here's the code: from mx.TextTools import * import exceptions stoplist = {'abc':1, 'def':1} class KeywordFound(exceptions.StandardError): def __init__(self, taglist): self.taglist = taglist def callable(taglist,text,l,r,subtags): taglist.append(text[l:r]) if stoplist.has_key(text[l:r]): raise KeywordFound(taglist) table = ( # Match all whitespace (None,AllInSet,whitespace_set,+1), # Match and tag all non-whitespace (callable,AllInSet + CallTag,nonwhitespace_set,+1), # Loop until EOF (None,EOF,Here,-2), ) def mysplitex(text): try: return tag(text,table)[1] except KeywordFound,data: return data.taglist
Or what if the target string isn't "a string" (the code I posted only assumes the "str" object responds to indexing and slicing -- any buffer object is fine -- so my example doesn't change at all)?
The current version only handles string objects, but I am already beginning to convert all the APIs in mxTextTools to "s#" or "t#" style (can't decide which to use... "s#" is great for processing raw data, while "t#" more closely refers to text processing).
Or what if you need to pass the tokens on as they're found, pipeline style? Etc. This is why I do complex string processing in Icon <0.9 wink>.
You can have all that extra magic via callable tag objects or callable matching functions. It's not exactly nice to write, but I'm sure that a meta-language could do the conversions for you.
OTOH, at what it does well, mxTextTools runs quicker than Icon. Its biggest problem has always been that e.g. nobody knows what the hell
(None,EOF,Here,-2),
*means* at first glance -- or third <wink>.
The structure of those tag tables is very simple: (tagobject, command, argument[, jump offset in case of failure [, jump offset in case of success]]) Please remember that this is byte code, not some higher level abstraction. The design is very much inverted from what you'd usually do: design a nice language and then try to find suitable set of byte codes to make it work as intended. Anyway, I'll keep focussing on the speed aspect of mxTextTools; others can focus on abstractions, so that eventually everybody will be happy :-) Happy New Year, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: Get ready to party ! Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/