Problem with sgmllib.py parsing attributes

Ronald Hiller ron at graburn.com
Fri Jan 14 08:17:22 EST 2000


I've been having problems parsing some (admittedly bad) HTML that fails
to quote attribute strings.
eg. <IMG SRC=xyx/abc=yes?def=no&more=$cool$>
To fix this, I added the & and $ characters to the attrfind pattern.

Does anyone see a downside to this?

Thanks
Ron

-=-=-=-=-=-=- patch below -=-=-=-=-=-=-=-

*** sgmllib.py  Mon Jan 25 16:57:07 1999
--- /tmp/sgmllib.py Fri Jan 14 07:59:29 2000
***************
*** 37,43 ****
  attrfind = re.compile(
      '[%s]*([a-zA-Z_][-.a-zA-Z_0-9]*)' % string.whitespace
      + ('([%s]*=[%s]*' % (string.whitespace, string.whitespace))
!     + r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:+*%?!\(\)_#=~]*))?')


  # SGML parser base class -- find tags and call handler functions.
--- 37,43 ----
  attrfind = re.compile(
      '[%s]*([a-zA-Z_][-.a-zA-Z_0-9]*)' % string.whitespace
      + ('([%s]*=[%s]*' % (string.whitespace, string.whitespace))
!     + r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:+*%?!&$\(\)_#=~]*))?')


  # SGML parser base class -- find tags and call handler functions.





More information about the Python-list mailing list