[Python-Dev] Changes in html.parser may cause breakage in client code

Georg Brandl g.brandl at gmx.net
Thu Apr 26 21:26:08 CEST 2012

On 26.04.2012 21:10, Vinay Sajip wrote:
> Following recent changes in html.parser, the Python 3 port of Django I'm working
> on has started failing while parsing HTML.
> The reason appears to be that Django uses some module-level data in html.parser,
> for example tagfind, which is a regular expression pattern. This has changed
> recently (Ezio changed it in ba4baaddac8d).
> Now tagfind (and other such patterns) are not marked as private (though not
> documented), but should they be? The following script (tagfind.py):
>     import html.parser as Parser
>     data = '<select name="stuff">'
>     m = Parser.tagfind.match(data, 1)
>     print('%r -> %r' % (Parser.tagfind.pattern, data[1:m.end()]))
> gives different results on 3.2 and 3.3:
>     $ python3.2 tagfind.py
>     '[a-zA-Z][-.a-zA-Z0-9:_]*' -> 'select'
>     $ python3.3 tagfind.py
>     '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:\\s|/(?!>))*' -> 'select '
> The trailing space later causes a mismatch with the end tag, and leads to the
> errors. Django's use of the tagfind pattern is in a subclass of HTMLParser, in
> an overridden parse_startag method.
> Do we need to indicate more strongly that data like tagfind are private? Or has
> the change introduced inadvertent breakage, requiring a fix in Python?

Since it's a module level constant without a leading underscore, IMO it was
okay for Django to use it, even if not documented.

In this case, especially since we actually have evidence of someone using the
constant, I would keep it as-is and use a new (underscored, this time) name for
the new pattern.

And yes, I think that we do need to indicate private-ness of module-level data.


More information about the Python-Dev mailing list