minidom xml & non ascii / unicode & files

"Martin v. Löwis" martin at v.loewis.de
Sat Aug 6 20:15:35 CEST 2005


> so what i understood of all this, is that once you're using unicode
> objects you're safe !
> At least as long as you don't use statements or operators that will
> implicitely try to convert the unicode object back to bytestring using
> your default encoding (ascii) which will most certainly result in codec
> Errors...

Correct.

> Also, minidom seems to use unicode object what was not really documented
> in the python 2.3 doc i've read about it..

It might be somewhat hidden:

http://docs.python.org/lib/dom-type-mapping.html

"DOMString defined in the recommendation is mapped to a Python string or
Unicode string. Applications should be able to handle Unicode whenever a
string is returned from the DOM."

http://docs.python.org/lib/minidom-and-dom.html
"The type DOMString maps to Python strings. xml.dom.minidom supports
either byte or Unicode strings, but will normally produce Unicode
strings. Values of type DOMString may also be None where allowed to have
the IDL null value by the DOM specification from the W3C."

In principle, you should fill Unicode strings into DOM trees all the
time, but it will work with byte strings as well as long as they are
ASCII.

> As a matter of fact using the following sequence will most certainly fail :
> f = codecs.open('utf8codecs.xml', 'w', 'utf-8')
> f.write(dom.toxml(encoding="utf-8"))
> f.close()

Correct. A codecs.StreamWriter expects Unicode objects, whereas toxml
returns byte strings (atleast if you pass an encoding - because of a
bug, it might return a Unicode string otherwise)

> then again maybe this will work, i just thought of it..
> f = codecs.open('utf8codecs.xml', 'w', 'utf-8')
> f.write(dom.toxml())
> f.close()

Yeah, toxml() returned Unicode because of a bug - but for backwards
compatibility, this cannot be changed. People should explicitly pass
an encoding.

> The next important thing is to make sure to use functions and objects
> that support unicode all the way, like minidom seems to do..

Indeed, there are still many functions in the standard library which
don't work with Unicode strings, but should. Some functions, of course,
are only meaningful for byte strings (like networking API).

Regards,
Martin



More information about the Python-list mailing list