[issue7311] Bug on regexp of HTMLParser

Ezio Melotti report at bugs.python.org
Sun Mar 27 15:57:21 CEST 2011


Ezio Melotti <ezio.melotti at gmail.com> added the comment:

The HTML 4.01 specifications says[0]:
"""
In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them.
"""

The HTML 5 draft says[1]:
"""
The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.
"""

So maybe [^>\s] is a little too permissive here.

[0]: http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2
[1]: http://dev.w3.org/html5/spec/Overview.html#attributes-0

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue7311>
_______________________________________


More information about the Python-bugs-list mailing list