[Python-Dev] Changes in html.parser may cause breakage in client code
Guido van Rossum
guido at python.org
Thu Apr 26 21:21:49 CEST 2012
On Thu, Apr 26, 2012 at 12:10 PM, Vinay Sajip <vinay_sajip at yahoo.co.uk> wrote:
> Following recent changes in html.parser, the Python 3 port of Django I'm working
> on has started failing while parsing HTML.
> The reason appears to be that Django uses some module-level data in html.parser,
> for example tagfind, which is a regular expression pattern. This has changed
> recently (Ezio changed it in ba4baaddac8d).
> Now tagfind (and other such patterns) are not marked as private (though not
> documented), but should they be? The following script (tagfind.py):
> import html.parser as Parser
> data = '<select name="stuff">'
> m = Parser.tagfind.match(data, 1)
> print('%r -> %r' % (Parser.tagfind.pattern, data[1:m.end()]))
> gives different results on 3.2 and 3.3:
> $ python3.2 tagfind.py
> '[a-zA-Z][-.a-zA-Z0-9:_]*' -> 'select'
> $ python3.3 tagfind.py
> '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:\\s|/(?!>))*' -> 'select '
> The trailing space later causes a mismatch with the end tag, and leads to the
> errors. Django's use of the tagfind pattern is in a subclass of HTMLParser, in
> an overridden parse_startag method.
> Do we need to indicate more strongly that data like tagfind are private? Or has
> the change introduced inadvertent breakage, requiring a fix in Python?
I think both. Looks like it wasn't meant to be exported. But it should
have been marked as such. And I think it would behoove us to reduce
random failures in important 3rd party libraries by keeping the old
version around (but mark it as deprecated with an explaining comment,
and submit a Django fix to stop using it).
Also the module should be updated to use _tagfind internally (and
likewise for other accidental exports).
Traditionally we've been really lax about this stuff. We should strive
to improve and clarify the exact boundaries of our APIs better.
--Guido van Rossum (python.org/~guido)
More information about the Python-Dev