[Python-bugs-list] [ python-Bugs-436621 ] sgmllib tag/attrib regexpr too strict?

noreply@sourceforge.net noreply@sourceforge.net
Tue, 26 Jun 2001 23:39:48 -0700


Bugs item #436621, was opened at 2001-06-26 23:39
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=436621&group_id=5470

Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Dustin Boswell (boswell)
Assigned to: Nobody/Anonymous (nobody)
Summary: sgmllib tag/attrib regexpr too strict?

Initial Comment:
1) I've seen tags like
<UNDER_SCORE> blah </UNDER_SCORE>
which the SGMLParser will not find correctly.
I'm guessing it has to do with the reg-expr for
tagfind:
tagfind = re.compile('[a-zA-Z][-.a-zA-Z0-9]*')

Does the spec allow for _ ?  Even if it doesn't,
maybe tagfind should be changed...
tagfind ?= re.compile('[a-zA-Z][-.a-zA-Z0-9_]*')

2) I've seen attributes with backquotes ` in them.
<a href=http://blah?key=val```junk``>
where key has the value val```junk``

Currently, attrfind (the regular expression for
such things) is
attrfind = re.compile( ...
r'\s*([a-zA-Z_][-.a-zA-Z_0-9]*) ...
(\s*=\s*'r'(\'[^\']*\'|"[^"]*"| ...
[-a-zA-Z0-9./:;+*%?!&$\(\)_#=~]*))?')

Would it hurt to add ` to long list of characters
that are already there?  Netscape seems to allow
them.

Thoughts?

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=436621&group_id=5470