[Python-Dev] Changes in html.parser may cause breakage in client code

Guido van Rossum guido at python.org
Fri Apr 27 16:36:06 CEST 2012


Someone should contact the Django folks. Alex Gaynor?

On Thursday, April 26, 2012, Ezio Melotti wrote:

> Hi,
>
> On 26/04/2012 22.10, Vinay Sajip wrote:
>
>> Following recent changes in html.parser, the Python 3 port of Django I'm
>> working
>> on has started failing while parsing HTML.
>>
>> The reason appears to be that Django uses some module-level data in
>> html.parser,
>> for example tagfind, which is a regular expression pattern. This has
>> changed
>> recently (Ezio changed it in ba4baaddac8d).
>>
>
> html.parser doesn't use any private _name, so I was considering part of
> the public API only the documented names.  Several methods are marked with
> an "# internal" comment, but that's not visible unless you go read the
> source code.
>
>  Now tagfind (and other such patterns) are not marked as private (though
>> not
>> documented), but should they be? The following script (tagfind.py):
>>
>>     import html.parser as Parser
>>
>>     data = '<select name="stuff">'
>>
>>     m = Parser.tagfind.match(data, 1)
>>     print('%r ->  %r' % (Parser.tagfind.pattern, data[1:m.end()]))
>>
>> gives different results on 3.2 and 3.3:
>>
>>     $ python3.2 tagfind.py
>>     '[a-zA-Z][-.a-zA-Z0-9:_]*' ->  'select'
>>     $ python3.3 tagfind.py
>>     '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:**\\s|/(?!>))*' ->  'select'
>>
>> The trailing space later causes a mismatch with the end tag, and leads to
>> the
>> errors. Django's use of the tagfind pattern is in a subclass of
>> HTMLParser, in
>> an overridden parse_startag method.
>>
>
> Django shouldn't override parse_starttag (internal and undocumented), but
> just use handle_starttag (public and documented).
> I see two possible reasons why it's overriding parse_starttag:
>  1) Django is working around an HTMLParser bug.  In this case the bug
> could have been fixed (leading to the breakage of the now-useless
> workaround), and now you could be able to use the original parse_starttag
> and have the correct result.  If it is indeed working around a bug and the
> bug is still present, you should report it upstream.
>  2) Django is implementing an additional feature.  Depending on what
> exactly the code is doing you might want to open a new feature request on
> the bug tracker. For example the original parse_starttag sets a
> self.lasttag attribute with the correct name of the last tag parsed.  Note
> however that both parse_starttag and self.lasttag are internal and
> shouldn't be used directly (but lasttag could be exposed and documented if
> people really think that it's useful).
>
>  Do we need to indicate more strongly that data like tagfind are private?
>> Or has
>> the change introduced inadvertent breakage, requiring a fix in Python?
>>
>
> I'm not sure that reverting the regex, deprecate all the exposed internal
> names, and add/use internal _names instead is a good idea at this point.
>  This will cause more breakage, and it would require an extensive renaming.
>  I can add notes to the documentation/docstrings and specify what's private
> and what's not though.
> OTOH, if this specific fix is not released yet I can still do something to
> limit/avoid the breakage.
>
> Best Regards,
> Ezio Melotti
>
>  Regards,
>>
>> Vinay Sajip
>>
>>
> ______________________________**_________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/**mailman/listinfo/python-dev<http://mail.python.org/mailman/listinfo/python-dev>
> Unsubscribe: http://mail.python.org/**mailman/options/python-dev/**
> guido%40python.org<http://mail.python.org/mailman/options/python-dev/guido%40python.org>
>


-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120427/fafe844d/attachment.html>


More information about the Python-Dev mailing list