htmllib: CR in CDATA
Mark Nottingham
mnot at pobox.com
Tue Jun 22 07:09:41 EDT 1999
And here's a go at a patch. Looking at the DTD, practically all attribute
types are CDATA, and those that aren't shouldn't have tabs or newlines in
them anyway. It uses string.translate; how efficient is this?
*** /opt/local/lib/python1.5/sgmllib.py Thu Apr 15 00:54:11 1999
--- sgmllib.py Tue Jun 22 21:02:05 1999
***************
*** 38,43 ****
--- 38,44 ----
'[%s]*([a-zA-Z_][-.a-zA-Z_0-9]*)' % string.whitespace
+ ('([%s]*=[%s]*' % (string.whitespace, string.whitespace))
+ r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:+*%?!\(\)_#=~]*))?')
+ cdata_tr = string.maketrans('\t\n', ' ')
# SGML parser base class -- find tags and call handler functions.
***************
*** 251,256 ****
--- 252,258 ----
elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
+ attrvalue = string.translate(attrvalue, cdata_tr, '\r')
attrs.append((string.lower(attrname), attrvalue))
k = match.end(0)
if rawdata[j] == '>':
More information about the Python-list
mailing list