[Tutor] Unicode issues

Kent Johnson kent37 at tds.net
Thu Feb 24 13:51:04 CET 2005


Michael Lange wrote:
> I *thought* I would have to convert the user input which might be any encoding back into
> byte string first 

How are you getting the user input? Is it from the console or from a GUI?

I think the best strategy is to try to keep all your strings as Unicode. Unicode is the only 
encoding that can represent characters from any locale. (That's the point of Unicode, actually.) So 
I would convert the user input to unicode, not to a byte string.

> (remember, I got heavily confused, because user input was sometimes unicode and
> sometimes byte string), so I can convert it to "standard" unicode (utf-8) later on.

Careful! I wouldn't call utf-8 "standard unicode". UTF-8 is a standard *encoding* of Unicode. 
Unicode is a 16-bit code.

> I've added this test to the file selection method, where "result" holds the filename the user chose:
> 
>     if isinstance(result, unicode):
>         result = result.encode('iso8859-1')
>     return result

This will fail if result includes characters that are not in the iso8859-1 repertoire.

> 
> later on self.nextfile is set to "result" .
> 
> The idea was, if I could catch the user's encoding, I could do something like:
> 
>     if isinstance(result, unicode):
>         result = result.encode(sys.stdin.encoding)
>     result = unicode(result, 'utf-8')

This is broken code that will corrupt your result string. Here is what it does:
if result is a unicode string, convert it to a byte string in the standard encoding. Then, assume 
that the byte string is in utf-8 encoding and convert it back to Unicode. Do you see why that is 
unlikely to have a good result?

If your intent is to create a unicode string, try this:
     if not isinstance(result, unicode):
         result = result.decode(sys.stdin.encoding)

> 
> to avoid problems with unicode objects that have different encodings - or isn't this necessary at all ?
> 
> I'm sorry if this is a dumb question, but I'm afraid I'm a complete encoding-idiot.

This article gives a lot of good background:
http://www.joelonsoftware.com/articles/Unicode.html

I have written an essay about console encoding issues. At the end there is a collection of links to 
more general Python and Unicode articles.
http://www.pycs.net/users/0000323/stories/14.html

Kent

> 
> Thanks and best regards
> 
> Michael
> 
> 
> 
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 




More information about the Tutor mailing list