Code that ought to run fast, but can't due to Python limitations.
Paul Rubin
http
Sat Jul 4 19:48:19 EDT 2009
John Nagle <nagle at animats.com> writes:
> A dictionary lookup (actually, several of them) for every
> input character is rather expensive. Tokenizers usually index into
> a table of character classes, then use the character class index in
> a switch statement.
Maybe you could use a regexp (and then have -two- problems...) to
find the token boundaries, then a dict to identify the actual token.
Tables of character classes seem a bit less attractive in the Unicode
era than in the old days.
More information about the Python-list
mailing list