codecs, Swedish characters, and XML...don't mix? (repost)

Michael Hammill mike at pdc.kth.se
Fri May 11 17:12:20 CEST 2001


Dear Andrew,

You were right.  By breaking the line up as you suggested, I found that the 
error was in the writing part, not explicitly in the dom.toxml().  I now 
have a UTF-8 output XML file, but I have run into a new problem, which you 
foresaw: namely, .toxml() gives me an XML header of
<?xml version="1.0" ?> however, this should be <?xml verion="1.0" 
encoding="UTF-8" ?>.  Additionally, I would like to put back in the 
<!DOCTYPE> line that minidom stripped out.  It contains a DTD I'm 
validating against.

I thought I could do this by opening the UTF-8 file containing the .toxml() 
output and replacing the <?xml?> line with the proper one and then adding 
the <!DOCTYPE>.  This seems problematic.  I get no errors or tracebacks, 
but no replacement.  I'm using Python 2.1's re module, which I read is 
unicode aware, but I'm obviously doing something  wrong.  Here's what I'm 
doing:

f = codecs.open('outfromtoxml', 'rb', 'UTF-8')
g = codecs.open('new', 'wb', 'UTF-8')

file_string = f.read()
f.close()
bad_xml_pi = u'<?xml version="1.0" ?>'
good_xml_pi = u'<?xml version="1.0" encoding="UTF-8" ?>'
good_doctype = u'<!DOCTYPE ...... I'll spare you ...>'
(new_result, n) = re.subn(bad_xml_pi, good_xml_pi + good_doctype, file_string)
g.write(file_string)
g.close()

I get the output without the hoped-for change.  I tried using .readlines 
instead of .read, but oddly got only a null list.  From what I read, it 
appears .readlines probably can't interpret line breaks in unicode 
files.  Shouldn't .read and re work?

Alternatively, I can see how to add comments and processing instructions 
using dom.minidom, but I see no way to do the two replacements above in 
dom.minidom.  Any ideas?

Thank you for your kind help!
Best regards,
Mike





More information about the Python-list mailing list