[Python-Dev] Changes in html.parser may cause breakage in client code
Guido van Rossum
guido at python.org
Fri Apr 27 16:36:06 CEST 2012
Someone should contact the Django folks. Alex Gaynor?
On Thursday, April 26, 2012, Ezio Melotti wrote:
> On 26/04/2012 22.10, Vinay Sajip wrote:
>> Following recent changes in html.parser, the Python 3 port of Django I'm
>> on has started failing while parsing HTML.
>> The reason appears to be that Django uses some module-level data in
>> for example tagfind, which is a regular expression pattern. This has
>> recently (Ezio changed it in ba4baaddac8d).
> html.parser doesn't use any private _name, so I was considering part of
> the public API only the documented names. Several methods are marked with
> an "# internal" comment, but that's not visible unless you go read the
> source code.
> Now tagfind (and other such patterns) are not marked as private (though
>> documented), but should they be? The following script (tagfind.py):
>> import html.parser as Parser
>> data = '<select name="stuff">'
>> m = Parser.tagfind.match(data, 1)
>> print('%r -> %r' % (Parser.tagfind.pattern, data[1:m.end()]))
>> gives different results on 3.2 and 3.3:
>> $ python3.2 tagfind.py
>> '[a-zA-Z][-.a-zA-Z0-9:_]*' -> 'select'
>> $ python3.3 tagfind.py
>> '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:**\\s|/(?!>))*' -> 'select'
>> The trailing space later causes a mismatch with the end tag, and leads to
>> errors. Django's use of the tagfind pattern is in a subclass of
>> HTMLParser, in
>> an overridden parse_startag method.
> Django shouldn't override parse_starttag (internal and undocumented), but
> just use handle_starttag (public and documented).
> I see two possible reasons why it's overriding parse_starttag:
> 1) Django is working around an HTMLParser bug. In this case the bug
> could have been fixed (leading to the breakage of the now-useless
> workaround), and now you could be able to use the original parse_starttag
> and have the correct result. If it is indeed working around a bug and the
> bug is still present, you should report it upstream.
> 2) Django is implementing an additional feature. Depending on what
> exactly the code is doing you might want to open a new feature request on
> the bug tracker. For example the original parse_starttag sets a
> self.lasttag attribute with the correct name of the last tag parsed. Note
> however that both parse_starttag and self.lasttag are internal and
> shouldn't be used directly (but lasttag could be exposed and documented if
> people really think that it's useful).
> Do we need to indicate more strongly that data like tagfind are private?
>> Or has
>> the change introduced inadvertent breakage, requiring a fix in Python?
> I'm not sure that reverting the regex, deprecate all the exposed internal
> names, and add/use internal _names instead is a good idea at this point.
> This will cause more breakage, and it would require an extensive renaming.
> I can add notes to the documentation/docstrings and specify what's private
> and what's not though.
> OTOH, if this specific fix is not released yet I can still do something to
> limit/avoid the breakage.
> Best Regards,
> Ezio Melotti
>> Vinay Sajip
> Python-Dev mailing list
> Python-Dev at python.org
> Unsubscribe: http://mail.python.org/**mailman/options/python-dev/**
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-Dev