Code that ought to run fast, but can't due to Python limitations.

Sun Jul 5 07:41:27 EDT 2009

On Jul 5, 3:12 am, "Hendrik van Rooyen" <m... at microcorp.co.za> wrote:
>
> Use a dispatch dict, and have each state return the next state.
> Then you can use strings representing state names, and
> everybody will be able to understand the code.
>
> toy example, not tested, nor completed:
>
> protocol = {"start":initialiser,"hunt":hunter,"classify":classifier,....other
> states}
>
> def state_machine():
>     next_step = protocol["start"]()
>     while True:
>         next_step = protocol[next_step]()
>

I've just spent about an hour looking over this code, with a few
comments to inject to the thread here:

- To all those suggesting the OP convert to a dispatch table, be
assured that this code is well aware of this idiom.  It is used
HEAVILY at a macro level, picking through the various HTML states
(starting a tag, reading attributes, reading body, etc.).  There still
are a number of cascading if-elif's within some of these states, and
some of them *may* be candidates for further optimization.

- There is an underlying HTMLInputStream that seems to be doing some
unnecessary position bookkeeping (positionLine and positionCol).
Commenting this out increases my test speed by about 13%.  In my
ignorance, I may be removing some important behavior, but this does
not seem to be critical as I tested against a few megs of HTML
source.  Before blaming the tokenizer for everything, there may be
more performance to be wrung from the input stream processor.  For
that matter, I would guess that about 90% of all HTML files that this
code would process would easily fit in memory - in that case, the
stream processing (and all of the attendant "if I'm not at the end of
the current chunk" code) could be skipped/removed entirely.

- The HTMLInputStream's charsUntil code is an already-identified
bottleneck, and some re enhancements have been applied here to help
out.

- Run-time construction of tuple literals where the tuple members are
constants can be lifted out.  emitCurrentToken rebuilds this tuple
every time it is called (which is a lot!):

    if (token["type"] in (tokenTypes["StartTag"], tokenTypes
["EndTag"], tokenTypes["EmptyTag"])):

Move this tuple literal into a class constant (or if you can tolerate
it, a default method argument to gain LOAD_FAST benefits - sometimes
optimization isn't pretty).

- These kinds of optimizations are pretty small, and only make sense
if they are called frequently.  Tallying which states are called in my
test gives the following list in decreasing frequency.  Such a list
would help guide your further tuning efforts:

tagNameState	194848
dataState	182179
attributeNameState	116507
attributeValueDoubleQuotedState	114931
tagOpenState	105556
beforeAttributeNameState	58612
beforeAttributeValueState	58216
afterAttributeValueState	58083
closeTagOpenState	50547
entityDataState	1673
attributeValueSingleQuotedState	1098
commentEndDashState	372
markupDeclarationOpenState	370
commentEndState	364
commentStartState	362
commentState	362
selfClosingStartTagState	359
doctypePublicIdentifierDoubleQuotedState	291
doctypeSystemIdentifierDoubleQuotedState	247
attributeValueUnQuotedState	191
doctypeNameState	32
beforeDoctypePublicIdentifierState	16
afterDoctypePublicIdentifierState	14
afterDoctypeNameState	9
doctypeState	8
beforeDoctypeNameState	8
afterDoctypeSystemIdentifierState	6
afterAttributeNameState	5
commentStartDashState	2
bogusCommentState	2

For instance, I wouldn't bother doing much tuning of the
bogusCommentState.  Anything called fewer than 50,000 times in this
test doesn't look like it would be worth the trouble.

-- Paul

(Thanks to those who suggested pyparsing as an alternative, but I
think this code is already beyond pyparsing in a few respects.  For
one thing, this code works with an input stream, in order to process
large HTML files; pyparsing *only* works with an in-memory string.
This code can also take advantage of some performance short cuts,
knowing that it is parsing HTML; pyparsing's generic classes can't do
that.)