Unicode/ascii encoding nightmare

Paul Boddie paul at boddie.org.uk
Tue Nov 7 05:57:15 EST 2006


Thomas W wrote:
> Ok, I've cleaned up my code abit and it seems as if I've
> encoded/decoded myself into a corner ;-).

Yes, you may encounter situations where you have some string, you
"decode" it (ie. convert it to Unicode) using one character encoding,
but then you later "encode" it (ie. convert it back to a plain string)
using a different character encoding. This isn't a problem on its own,
but if you then take that plain string and attempt to convert it to
Unicode again, using the same input encoding as before, you'll be
misinterpreting the contents of the string.

This "round tripping" of character data is typical of Web applications:
you emit a Web page in one encoding, the fields in the forms are
represented in that encoding, and upon form submission you receive this
data. If you then process the form data using a different encoding,
you're misinterpreting what you previously emitted, and when you emit
this data again, you compound the error.

> My understanding of unicode has room for improvement, that's for sure. I got some pointers
> and initial code-cleanup seem to have removed some of the strange results I got, which
> several of you also pointed out.

Converting to Unicode for processing is a "best practice" that you seem
to have adopted, but it's vital that you use character encodings
consistently. One trick, that can be used to mitigate situations where
you have less control over the encoding of data given to you, is to
attempt to convert to Unicode using an encoding that is "conservative"
with regard to acceptable combinations of byte sequences, such as
UTF-8; if such a conversion fails, it's quite possible that another
encoding applies, such as ISO-8859-1, and you can try that. Since
ISO-8859-1 is a "liberal"  encoding, in the sense that any byte value
or combination of byte values is acceptable, it should only be used as
a last resort.

However, it's best to have a high level of control over character
encodings rather than using tricks to avoid considering representation
issues carefully.

Paul




More information about the Python-list mailing list