
Lexers are painful in Python. They hit the language in a weak spot created by the immutability of strings. I've found this an obstacle more than once, but then I'm a battle-scarred old compiler jock who attacks *everything* with lexers and parsers.
I think you're exaggerating the problem, or at least underestimating the re module. The re module is pretty fast! Reading a file line-by-line is very fast in Python 2.3 with the new "for line in open(filename)" idiom. I just scanned nearly a megabyte of ugly data (a Linux kernel) in 0.6 seconds using the regex '\w+', finding 177,000 words. The regex (?:\d+|[a-zA-Z_]+) took 1 second, yielding 1 second, finding 190,000 words. I expect that the list creation (one hit at a time) took more time than the matching. --Guido van Rossum (home page: http://www.python.org/~guido/)