Code that ought to run fast, but can't due to Python limitations.

Sat Jul 4 14:03:52 EDT 2009

On Sat, Jul 4, 2009 at 1:33 PM, John Nagle<nagle at animats.com> wrote:
>   As an example of code that really needs to run fast, but is
> speed-limited by Python's limitations, see "tokenizer.py" in
>
>        http://code.google.com/p/html5lib/
>
> This is a parser for HTML 5, a piece of code that will be needed
> in many places and will process large amounts of data. It's written
> entirely in Python.  Take a look at how much work has to be performed
> per character.
>
> This is a good test for Python implementation bottlenecks.  Run
> that tokenizer on HTML, and see where the time goes.
>
> ("It should be written in C" is not an acceptable answer.)
>
> Python doesn't have a "switch" or "case" statement, and when
> you need a state machine with many states, that makes for painful,
> slow code.  There's a comment in the code that it would be useful
> to run a few billion lines of HTML through an instrumented version
> of the parser to decide in which order the IF statements should be
> executed.  You shouldn't have to do that.
>

If you're cases are hashable, just use a dict instead of the if chain.
Then you get the constant time access to it.

def func_a() :
   ...
def func_b():
   ...
def func_c():
   ....

case = {"a":func_a, "b":func_b, "c":func_c}

case[value]()

> Yes, I've read PEP 3103.  The big problem is the difficulty of figuring
> out what's a constant and what might change.  If all the cases are
> constants,
> case statements are easy.  But in Python, the compiler can't tell.
>
> Parsers have many named compile-time constants.  Python doesn't support
> named compile-time constants, and this is one of the places where we
> have to pay the bill for that limitation.
>
> Something to think about when you need three more racks of servers
> because the HTML parser is slow.
>
>                                John Nagle
> --
> http://mail.python.org/mailman/listinfo/python-list
>