[Python-bugs-list] [ python-Bugs-436621 ] sgmllib tag/attrib regexpr too strict?
noreply@sourceforge.net
noreply@sourceforge.net
Thu, 05 Jul 2001 11:25:08 -0700
Bugs item #436621, was opened at 2001-06-26 23:39
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=436621&group_id=5470
Category: Python Library
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Dustin Boswell (boswell)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: sgmllib tag/attrib regexpr too strict?
Initial Comment:
1) I've seen tags like
<UNDER_SCORE> blah </UNDER_SCORE>
which the SGMLParser will not find correctly.
I'm guessing it has to do with the reg-expr for
tagfind:
tagfind = re.compile('[a-zA-Z][-.a-zA-Z0-9]*')
Does the spec allow for _ ? Even if it doesn't,
maybe tagfind should be changed...
tagfind ?= re.compile('[a-zA-Z][-.a-zA-Z0-9_]*')
2) I've seen attributes with backquotes ` in them.
<a href=http://blah?key=val```junk``>
where key has the value val```junk``
Currently, attrfind (the regular expression for
such things) is
attrfind = re.compile( ...
r'\s*([a-zA-Z_][-.a-zA-Z_0-9]*) ...
(\s*=\s*'r'(\'[^\']*\'|"[^"]*"| ...
[-a-zA-Z0-9./:;+*%?!&$\(\)_#=~]*))?')
Would it hurt to add ` to long list of characters
that are already there? Netscape seems to allow
them.
Thoughts?
----------------------------------------------------------------------
>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2001-07-05 11:25
Message:
Logged In: YES
user_id=3066
Fixed in Lib/sgmllib.py revisions 1.32 and 1.30.2.1.
On the attribute issue: These are not legal attributes as
far as SGML is concerned, but Mozilla also allows the quote
characters in the value of an unquoted attribute values.
sgmllib now matches that behavior.
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=436621&group_id=5470