htmllib: CR in CDATA

Mark Nottingham mnot at pobox.com
Tue Jun 22 13:09:41 CEST 1999


And here's a go at a patch. Looking at the DTD, practically all attribute
types are CDATA, and those that aren't shouldn't have tabs or newlines in
them anyway. It uses string.translate; how efficient is this?


*** /opt/local/lib/python1.5/sgmllib.py Thu Apr 15 00:54:11 1999
--- sgmllib.py Tue Jun 22 21:02:05 1999
***************
*** 38,43 ****
--- 38,44 ----
      '[%s]*([a-zA-Z_][-.a-zA-Z_0-9]*)' % string.whitespace
      + ('([%s]*=[%s]*' % (string.whitespace, string.whitespace))
      + r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:+*%?!\(\)_#=~]*))?')
+ cdata_tr = string.maketrans('\t\n', '  ')


  # SGML parser base class -- find tags and call handler functions.
***************
*** 251,256 ****
--- 252,258 ----
              elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
                   attrvalue[:1] == '"' == attrvalue[-1:]:
                  attrvalue = attrvalue[1:-1]
+                 attrvalue = string.translate(attrvalue, cdata_tr, '\r')
              attrs.append((string.lower(attrname), attrvalue))
              k = match.end(0)
          if rawdata[j] == '>':






More information about the Python-list mailing list