[Python-Dev] Changes in html.parser may cause breakage in client code
Georg Brandl
g.brandl at gmx.net
Thu Apr 26 21:26:08 CEST 2012
On 26.04.2012 21:10, Vinay Sajip wrote:
> Following recent changes in html.parser, the Python 3 port of Django I'm working
> on has started failing while parsing HTML.
>
> The reason appears to be that Django uses some module-level data in html.parser,
> for example tagfind, which is a regular expression pattern. This has changed
> recently (Ezio changed it in ba4baaddac8d).
>
> Now tagfind (and other such patterns) are not marked as private (though not
> documented), but should they be? The following script (tagfind.py):
>
> import html.parser as Parser
>
> data = '<select name="stuff">'
>
> m = Parser.tagfind.match(data, 1)
> print('%r -> %r' % (Parser.tagfind.pattern, data[1:m.end()]))
>
> gives different results on 3.2 and 3.3:
>
> $ python3.2 tagfind.py
> '[a-zA-Z][-.a-zA-Z0-9:_]*' -> 'select'
> $ python3.3 tagfind.py
> '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:\\s|/(?!>))*' -> 'select '
>
> The trailing space later causes a mismatch with the end tag, and leads to the
> errors. Django's use of the tagfind pattern is in a subclass of HTMLParser, in
> an overridden parse_startag method.
>
> Do we need to indicate more strongly that data like tagfind are private? Or has
> the change introduced inadvertent breakage, requiring a fix in Python?
Since it's a module level constant without a leading underscore, IMO it was
okay for Django to use it, even if not documented.
In this case, especially since we actually have evidence of someone using the
constant, I would keep it as-is and use a new (underscored, this time) name for
the new pattern.
And yes, I think that we do need to indicate private-ness of module-level data.
Georg
More information about the Python-Dev
mailing list