Code that ought to run fast, but can't due to Python limitations.
steve at REMOVE-THIS-cybersource.com.au
Sun Jul 5 04:28:01 EDT 2009
On Sat, 04 Jul 2009 20:15:11 -0700, John Nagle wrote:
> Paul Rubin wrote:
>> John Nagle <nagle at animats.com> writes:
>>> A dictionary lookup (actually, several of them) for every
>>> input character is rather expensive. Tokenizers usually index into a
>>> table of character classes, then use the character class index in a
>>> switch statement.
>> Maybe you could use a regexp (and then have -two- problems...) to find
>> the token boundaries, then a dict to identify the actual token. Tables
>> of character classes seem a bit less attractive in the Unicode era than
>> in the old days.
> I want to see a regular expression that expresses the HTML 5 token
> parsing rules, including all the explicitly specified error handling.
Obviously the regex can't do the error handling. Nor should you expect a
single regex to parse an entire HTML document. But you could (perhaps)
use regexes to parse pieces of the document, as needed.
Have you investigated the pyparsing module? Unless you have some reason
for avoiding it, for any complicated parsing job I'd turn to that before
trying to roll your own.
> Here's some actual code, from "tokenizer.py". This is called once
> for each character in an HTML document, when in "data" state (outside a
> tag). It's straightforward code, but look at all those dictionary
Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't
see how a case statement would help you here: you're not dispatching on a
value, but running through a series of tests until one passes. There are
languages where you can write something like:
x > 0: process_positive(x)
x < 0: process_negative(x)
x == 0: process_zero(x)
but that's generally just syntactic sugar for the obvious if...elif...
block. (Although clever compilers might recognise that it's the same x in
each expression, and do something clever to optimize the code.)
Nor do I see why you were complaining about Python not having true
constants. I don't see how that would help you... most of your explicit
dict lookups are against string literals e.g. contentModelFlags["RCDATA"].
So while I feel your pain, I'm not sure I understand why you're blaming
this on *Python*.
More information about the Python-list