Problem with sgmllib.py parsing attributes
Ronald Hiller
ron at graburn.com
Fri Jan 14 08:17:22 EST 2000
I've been having problems parsing some (admittedly bad) HTML that fails
to quote attribute strings.
eg. <IMG SRC=xyx/abc=yes?def=no&more=$cool$>
To fix this, I added the & and $ characters to the attrfind pattern.
Does anyone see a downside to this?
Thanks
Ron
-=-=-=-=-=-=- patch below -=-=-=-=-=-=-=-
*** sgmllib.py Mon Jan 25 16:57:07 1999
--- /tmp/sgmllib.py Fri Jan 14 07:59:29 2000
***************
*** 37,43 ****
attrfind = re.compile(
'[%s]*([a-zA-Z_][-.a-zA-Z_0-9]*)' % string.whitespace
+ ('([%s]*=[%s]*' % (string.whitespace, string.whitespace))
! + r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:+*%?!\(\)_#=~]*))?')
# SGML parser base class -- find tags and call handler functions.
--- 37,43 ----
attrfind = re.compile(
'[%s]*([a-zA-Z_][-.a-zA-Z_0-9]*)' % string.whitespace
+ ('([%s]*=[%s]*' % (string.whitespace, string.whitespace))
! + r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:+*%?!&$\(\)_#=~]*))?')
# SGML parser base class -- find tags and call handler functions.
More information about the Python-list
mailing list