XML and UnicodeError

Paul Boddie paul at boddie.org.uk
Tue Oct 5 17:55:33 CEST 2004

Just <just at xs4all.nl> wrote in message news:<just-EE3837.12524305102004 at news1.news.xs4all.nl>...
> In article <Xns957977C1D83A7devnulloo at>,
>  Pinke Panke <dev at null.oo> wrote:
> > fill = "bar".encode('utf-8') # lets make it unicode
> That's not making it unicode; you mean
>   fill = unicode("bar", "utf-8")
> (Or "bar".decode("utf-8"), which does the same; I prefer using the 
> unicode builtin.)

So do I - it can be confusing to think of performing a decoding
operation on a string which yields a Unicode object as a result.

> > But now, I would think the safest way is to transfer all plain strings 
> > in the python script into a second XML file and use them, because 
> > after reading in they would be in Unicode. Right?
> Yes, but there's no need to. Are you perhaps using string literals 
> containing non-ascii chars, yet don't use the 'u' prefix? u"\xff" as 
> opposed to "\xff".

Having non-ASCII characters appear in string literals in the source
code can be somewhat risky, but there's always the encoding
declaration added in Python 2.3 to control the situation. Just
remember to convert any plain strings to Unicode in the program code.

> > Or saving the python script in utf-8 would make the difference?
> Depends...

...on that encoding declaration amongst other things.

> > >   4. Serialise to your chosen encoding only when preparing
> > >      output.
> > 
> > Every string concatenation in my script is preparing output.
> Do _all_ manipulations using unicode, and convert to utf-8 as late as 
> poosible, ie. when you're passing the result to code that expects 
> non-unicode data. That's basically what he was saying.

Yes, if you introduce a plain string anywhere, convert it to Unicode
as soon as you can, especially since such data always has a habit of
getting into those Unicode-related operations even though you didn't
think that it would. Once the different substitutions and other
processing is done, you may want to convert the Unicode values back to
plain strings, although I often find that this doesn't need doing
before a program's final output is prepared (and in the case of XML
serialisation, you want to leave this to the serialiser anyway, since
it will also write out which encoding it used when serialising).


More information about the Python-list mailing list