Fredrik Lundh fredrik at pythonware.com
Mon Jun 20 14:27:17 CEST 2005

Richard Lewis wrote:

> OK, I'm still not getting this unicode business.


> <document>
>     <a>aàáâã</a>
>     <e>eèéêë</e>
>     <i>iìíîï</i>
>     <o>oòóôõ</o>
>     <u>oùúûü</u>
> </document>
> (If testing, make sure you save this as utf-8 encoded.)

why?  that XML snippet doesn't include any UTF-8-encoded characters.


>    file = codecs.open(sys.argv[1], "r", "utf-8")
>    document = parse(file)
>    file.close()

why do you insist on decoding the stream you pass to the XML parser,
when you've already been told that you shouldn't do that?  change this

        document = parse(sys.argv[1])

>    print document.toxml(encoding="utf-8")

this converts the document to UTF-8, and prints it to stdout.  if you get
gibberish, your stdout wants some other encoding.  if you get "capital-
A-with-tilde" gibberish, your stdout expects ISO-8859-1.

try changing this to:

    print document.toxml(encoding=sys.stdout.encoding)

>    out_str = unicode2charrefs(document.toxml(encoding="utf-8"))

this converts the document to UTF-8, and then translates the *encoded*
data to character references as if the document had been encoded as ISO-
8859-1.  this makes no sense at all, and results in an XML document full
of "capital-A-with-tilde" gibberish.

> i.e., does anyone else get two byte sequences beginning with
> capital-A-with-tilde instead of the expected characters?

since you've requested UTF-8 output, "capital A with tilde" is the expected
result if you're directing output to an ISO-8859-1 stream.

> the output file is still wrong.

well, you're messing it up all by yourself.  getting rid of all the codecs and
unicode2charrefs nonsense will fix this:

        document = parse(sys.argv[1]) # parser decodes

        ... manipulate document ...

        file = open(..., "w")
        file.write(document.toxml(encoding="utf-8")) # writer encodes


