[Python-bugs-list] [ python-Bugs-745002 ] <> in attrs in sgmllib not handled
SourceForge.net
noreply@sourceforge.net
Sat, 14 Jun 2003 00:58:38 -0700
Bugs item #745002, was opened at 2003-05-28 18:30
Message generated for change (Comment added) made by loewis
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=745002&group_id=5470
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Samuel Bayer (sambayer)
Assigned to: Nobody/Anonymous (nobody)
Summary: <> in attrs in sgmllib not handled
Initial Comment:
Hi folks -
This bug is noted in the source code for sgmllib.py,
and it finally bit me. If you feed the SGMLParser class
text such as
<tag attr = "<attrtag> bar </attrtag>">foo</tag>
the <attrtag> will be processed as a tag, as well as
being recognized as part of the attribute. This is
because of the way the end index for the opening tag is
computed.
As far as I can tell from the HTML 4.01 specification,
this is legal. The case I encountered was in a value of
an "onmouseover" attribute, which was a Javascript call
which contained HTML text as one of its arguments.
The problem is in SGMLParser.parse_starttag, which
attempts to compute the end of the opening tag with a
simple regexp [<>], and uses this index even when the
attributes have passed it. There's no real need to
check this regexp in advance, as far as I can tell.
I've attached my proposed modification of
SGMLParser.parse_starttag; I've tested this change in
2.2.1, but there are no relevant differences between
2.2.1 and the head of the CVS tree for this method. No
guarantees of correctness, but it works on the examples
I've tested it on.
Cheers -
Sam Bayer
================================
w_endbracket = re.compile("\s*[<>]")
class SGMLParser:
# Internal -- handle starttag, return length or -1
if not terminated
def parse_starttag(self, i):
self.__starttag_text = None
start_pos = i
rawdata = self.rawdata
if shorttagopen.match(rawdata, i):
# SGML shorthand: <tag/data/ == <tag>data</tag>
# XXX Can data contain &... (entity or char
refs)?
# XXX Can data contain < or > (tag characters)?
# XXX Can there be whitespace before the
first /?
match = shorttag.match(rawdata, i)
if not match:
return -1
tag, data = match.group(1, 2)
self.__starttag_text = '<%s/' % tag
tag = tag.lower()
k = match.end(0)
self.finish_shorttag(tag, data)
self.__starttag_text =
rawdata[start_pos:match.end(1) + 1]
return k
# Now parse the data between i+1 and the end of
the tag into a tag and attrs
attrs = []
if rawdata[i:i+2] == '<>':
# SGML shorthand: <> == <last open tag seen>
k = i + 1
tag = self.lasttag
else:
match = tagfind.match(rawdata, i+1)
if not match:
self.error('unexpected call to
parse_starttag')
k = match.end(0)
tag = rawdata[i+1:k].lower()
self.lasttag = tag
while w_endbracket.match(rawdata, k) is None:
match = attrfind.match(rawdata, k)
if not match: break
attrname, rest, attrvalue = match.group(1,
2, 3)
if not rest:
attrvalue = attrname
elif attrvalue[:1] == '\'' ==
attrvalue[-1:] or \
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
attrs.append((attrname.lower(), attrvalue))
k = match.end(0)
match = endbracket.search(rawdata, k)
if not match:
return -1
j = match.start(0)
if rawdata[j] == '>':
j = j+1
self.__starttag_text = rawdata[start_pos:j]
self.finish_starttag(tag, attrs)
return j
----------------------------------------------------------------------
>Comment By: Martin v. Löwis (loewis)
Date: 2003-06-14 09:58
Message:
Logged In: YES
user_id=21627
If this is a known bug, why are you reporting it?
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=745002&group_id=5470