
Tim Peters wrote:
[Gordon McMillan]
mxTextTools lets (encourages?) you to break all the rules about lex -> parse. If you can (& want to) put a good deal of the "parse" stuff into the scanning rules, you can get a speed advantage. You're also not constrained by the rules of BNF, if you choose to see that as an advantage :-).
My one successful use of mxTextTools came after using SPARK to figure out what I actually needed in my AST, and realizing that the ambiguities in the grammar didn't matter in practice, so I could produce an almost-AST directly.
I don't expect anyone will have much luck writing a fast lexer using mxTextTools *or* Python's regexp package unless they know quite a bit about how each works under the covers, and about how fast lexing is accomplished by DFAs. If you know both, you can build a DFA by hand and painfully instruct mxTextTools in the details of its construction, and get a very fast tokenizer (compared to what's possible with re), regardless of the number of token classes or the complexity of their definitions. Writing to mxTextTools directly is a lot like writing in an assembly language for a character-matching machine, with all the pains and potential joys that implies. If I were Eric, I'd use Flex <wink>.
FYI, there are a few meta languages to make life easier for mxTextTools like e.g. Mike Fletcher's SimpleParse. The upcoming version 2.1 will also support Unicode and allows text jump targets which boosts readability of the tag tables a lot and makes hand-writing the tables much easier. The beta of 2.1 is available to the subscribers of the egenix-users mailing list. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/