Code that ought to run fast, but can't due to Python limitations.

Mon Jul 6 04:16:44 EDT 2009

Martin v. Löwis <martin <at> v.loewis.de> writes:

> > This is a good test for Python implementation bottlenecks.  Run
> > that tokenizer on HTML, and see where the time goes.
> 
> I looked at it with cProfile, and the top function that comes up
> for a larger document (52k) is
> ...validator.HTMLConformanceChecker.__iter__.
[...]
> With this simple optimization, I get a 20% speedup on my test
> case. In my document, there are no attributes - the same changes
> should be made to attribute validation routines.
> 
> I don't think this has anything to do with the case statement.

I agree. I ran cProfile over just the tokenizer step; essentially

tokenizer = html5lib.tokenizer.HTMLStream(htmldata)
for tok in tokenizer:
    pass

It mostly *isn't* tokenizer.py that's taking the most time, it's
inputstream.py. (There is one exception:
tokenizer.py:HTMLStream.__init__ constructs a dictionary of states
each time -- this is unnecessary, replace all expressions like
self.states["attributeName"] with self.attributeNameState.)

I've done several optimisations -- I'll upload the patch to the
html5lib issue tracker. In particular,

* The .position property of EncodingBytes is used a lot. Every
self.position +=1 calls getPosition() and setPosition(). Another
getPosition() call is done in the self.currentByte property. Most of
these can be optimised away by using methods that move the position
and return the current byte.

* In HTMLInputStream, the current line number and column are updated
every time a new character is read with .char(). The current position
is *only* used in error reporting, so I reworked it to only calculate
the position when .position() is called, by keeping track of the
number of lines in previous read chunks, and computing the number of
lines to the current offset in the current chunk.

These give me about a 20% speedup.

This just illustrates that the first step in optimisation is profiling :D

As other posters have said, slurping the whole document into memory
and using a regexp-based parser (such as pyparsing) would likely give
you the largest speedups. If you want to keep the chunk- based
approach, you can still use regexp's, but you'd have to think about
matching on chunk boundaries. One way would be to guarantee a minimum
number of characters available, say 10 or 50 (unless end-of-file, of
course) -- long enough such that any *constant* string you'd want to
match like <![CDATA[ would fit inside that guaranteed length. Any
arbitrary-length tokens (such as attribute names and values) would be
matched, not with regexps like [a-z]+, but with [a-z]{1,10} (match
[a-z] from 1 to 10 times), and joining the individual matches together
to make one token.

Since html5lib has several implementations for several languages, it
may actually be worth it to generate lexers for each language from one
specification file.

Take care,
David M. Cooke <david.m.cooke at gmail.com>