utf8 and ftplib
fredrik at pythonware.com
Mon Jun 20 14:27:17 CEST 2005
Richard Lewis wrote:
> OK, I'm still not getting this unicode business.
> (If testing, make sure you save this as utf-8 encoded.)
why? that XML snippet doesn't include any UTF-8-encoded characters.
> file = codecs.open(sys.argv, "r", "utf-8")
> document = parse(file)
why do you insist on decoding the stream you pass to the XML parser,
when you've already been told that you shouldn't do that? change this
document = parse(sys.argv)
> print document.toxml(encoding="utf-8")
this converts the document to UTF-8, and prints it to stdout. if you get
gibberish, your stdout wants some other encoding. if you get "capital-
A-with-tilde" gibberish, your stdout expects ISO-8859-1.
try changing this to:
> out_str = unicode2charrefs(document.toxml(encoding="utf-8"))
this converts the document to UTF-8, and then translates the *encoded*
data to character references as if the document had been encoded as ISO-
8859-1. this makes no sense at all, and results in an XML document full
of "capital-A-with-tilde" gibberish.
> i.e., does anyone else get two byte sequences beginning with
> capital-A-with-tilde instead of the expected characters?
since you've requested UTF-8 output, "capital A with tilde" is the expected
result if you're directing output to an ISO-8859-1 stream.
> the output file is still wrong.
well, you're messing it up all by yourself. getting rid of all the codecs and
unicode2charrefs nonsense will fix this:
document = parse(sys.argv) # parser decodes
... manipulate document ...
file = open(..., "w")
file.write(document.toxml(encoding="utf-8")) # writer encodes
More information about the Python-list