xml.dom.minidom character encoding

Peter Otten __peter__ at web.de
Wed Apr 21 19:58:14 CEST 2010


C. Benson Manica wrote:

> I have the following simple script running on 2.5.2 on a machine where
> the default character encoding is "ascii":
> 
> #!/usr/bin/env python
> #coding: utf-8
> 
> import xml.dom.minidom
> import codecs
> 
> str=u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem attrib=
> \"ó\"/></elements>"
> doc=xml.dom.minidom.parseString( str )
> xml=doc.toxml( encoding="utf-8" )
> file=codecs.open( "foo.xml", "w", "utf-8" )
> file.write( xml )
> file.close()
> 
> I've specified utf-8 every place I can find that the documentation
> allows me to, and yet this doesn't even come close to working without
> UnicodeEncodeErrors.  What on Earth do I have to do to please the
> character encoding gods?

Verify every step as you proceed?

>>> import xml.dom.minidom
>>> s = u"<?xml version=\"1.0\" encoding=\"utf-8\"?><elements><elem 
attrib=\"ó\"/></elements>"
>>> doc = xml.dom.minidom.parseString(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/xml/dom/minidom.py", line 1925, in parseString
    return expatbuilder.parseString(string)
  File "/usr/lib/python2.5/xml/dom/expatbuilder.py", line 940, in 
parseString
    return builder.parseString(string)
  File "/usr/lib/python2.5/xml/dom/expatbuilder.py", line 223, in 
parseString
    parser.Parse(string, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 
62: ordinal not in range(128)

It seems that parseString() doesn't like unicode -- let's try a byte string 
then:

>>> doc = xml.dom.minidom.parseString(s.encode("utf-8"))
>>> xml = doc.toxml(encoding="utf-8")

No complaints -- let's have a look at the result:

>>> xml
'<?xml version="1.0" encoding="utf-8"?><elements><elem 
attrib="\xc3\xb3"/></elements>'

That's a byte string, no need for codecs.open() then:

>>> f = open("foo.xml", "w")
>>> f.write(xml)
>>> f.close()

Peter



More information about the Python-list mailing list