Code that ought to run fast, but can't due to Python limitations.

Paul Rubin http
Sun Jul 5 01:48:19 CEST 2009


John Nagle <nagle at animats.com> writes:
>     A dictionary lookup (actually, several of them) for every
> input character is rather expensive. Tokenizers usually index into
> a table of character classes, then use the character class index in
> a switch statement.

Maybe you could use a regexp (and then have -two- problems...) to
find the token boundaries, then a dict to identify the actual token.
Tables of character classes seem a bit less attractive in the Unicode
era than in the old days.



More information about the Python-list mailing list