codecs, Swedish characters, and XML...don't mix? (repost)

Fri May 11 16:00:40 EDT 2001

Michael Hammill <mike at pdc.kth.se> writes:
> bad_xml_pi = u'<?xml version="1.0" ?>'
> good_xml_pi = u'<?xml version="1.0" encoding="UTF-8" ?>'
> good_doctype = u'<!DOCTYPE ...... I'll spare you ...>'
> (new_result, n) = re.subn(bad_xml_pi, good_xml_pi + good_doctype, file_string)

? is a special character in regular expressions.  You should either use 
file_string.replace(bad_xml_pi, good_doctype + good_doctype), 
or run bad_xml_pi through re.escape() before passing it to re.subn.

>>> re.escape
<function escape at 0x8136e14>
>>> re.escape('<?xml version="1.0" ?>')
'\\<\\?xml\\ version\\=\\"1\\.0\\"\\ \\?\\>'

Arguably the minidom .toxml() method should provide a way to select an
encoding, though.

--amk