[Python-bugs-list] [ python-Bugs-436621 ] sgmllib tag/attrib regexpr too strict?

noreply@sourceforge.net noreply@sourceforge.net
Thu, 05 Jul 2001 11:25:08 -0700


Bugs item #436621, was opened at 2001-06-26 23:39
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=436621&group_id=5470

Category: Python Library
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Dustin Boswell (boswell)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: sgmllib tag/attrib regexpr too strict?

Initial Comment:
1) I've seen tags like
<UNDER_SCORE> blah </UNDER_SCORE>
which the SGMLParser will not find correctly.
I'm guessing it has to do with the reg-expr for
tagfind:
tagfind = re.compile('[a-zA-Z][-.a-zA-Z0-9]*')

Does the spec allow for _ ?  Even if it doesn't,
maybe tagfind should be changed...
tagfind ?= re.compile('[a-zA-Z][-.a-zA-Z0-9_]*')

2) I've seen attributes with backquotes ` in them.
<a href=http://blah?key=val```junk``>
where key has the value val```junk``

Currently, attrfind (the regular expression for
such things) is
attrfind = re.compile( ...
r'\s*([a-zA-Z_][-.a-zA-Z_0-9]*) ...
(\s*=\s*'r'(\'[^\']*\'|"[^"]*"| ...
[-a-zA-Z0-9./:;+*%?!&$\(\)_#=~]*))?')

Would it hurt to add ` to long list of characters
that are already there?  Netscape seems to allow
them.

Thoughts?

----------------------------------------------------------------------

>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2001-07-05 11:25

Message:
Logged In: YES 
user_id=3066

Fixed in Lib/sgmllib.py revisions 1.32 and 1.30.2.1.

On the attribute issue:  These are not legal attributes as
far as SGML is concerned, but Mozilla also allows the quote
characters in the value of an unquoted attribute values. 
sgmllib now matches that behavior.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=436621&group_id=5470