[New-bugs-announce] [issue26084] HTMLParser mishandles last attribute in self-closing tag
Tom Anderl
report at bugs.python.org
Mon Jan 11 15:48:45 EST 2016
New submission from Tom Anderl:
When the HTMLParser encounters a start tag element that includes:
1. an unquoted attribute as the final attribute
2. an optional '/' character marking the start tag as self-closing
3. no space between the final attribute and the '/' character
the '/' character gets attached to the attribute value and the element is interpreted as not self-closing. This can be illustrated with the following:
===============================================================================
import HTMLParser
# Begin Monkeypatch
#import re
#HTMLParser.attrfind = re.compile(
# r'((?<=[\'"\s/])[^\s/>][^\s/=>]*)(\s*=+\s*'
# r'(\'[^\']*\'|"[^"]*"|(?![\'"])[^/>\s]*))?(?:\s|/(?!>))*')
# End Monkeypatch
class MyHTMLParser(HTMLParser.HTMLParser):
def handle_starttag(self, tag, attrs):
print('got starttag: {0} with attributes {1}'.format(tag, attrs))
def handle_endtag(self, tag):
print('got endtag: {0}'.format(tag))
MyHTMLParser().feed('<img height=1.0 width=2.0/>')
==============================================================================
Running the above code yields the output:
got starttag: img with attributes [('height', '1.0'), ('width', '2.0/')]
Note the trailing '/' on the 'width' attribute. If I uncomment the monkey patch, the script then yields:
got starttag: img with attributes [('height', '1.0'), ('width', '2.0')]
got endtag: img
Note that the trailing '/' is gone, and an endtag event was generated.
----------
components: Library (Lib)
messages: 258013
nosy: Tom Anderl
priority: normal
severity: normal
status: open
title: HTMLParser mishandles last attribute in self-closing tag
type: behavior
versions: Python 2.7
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue26084>
_______________________________________
More information about the New-bugs-announce
mailing list