How does unicode() work?

Wed Jan 9 09:33:38 EST 2008

Carsten Haese wrote:

> If that really is the line that barfs, wouldn't it make more sense to
> repr() the unicode object in the second position?
> 
> import sys
> for k in sys.stdin:
>      print '%s -> %s' % (k, repr(k.decode('iso-8859-1')))
> 
> Also, I'm not sure if the OP has told us the truth about his code and/or
> his error message. The implicit str() call done by formatting a unicode
> object with %s would raise a UnicodeEncodeError, not the
> UnicodeDecodeError that the OP is reporting. So either I need more
> coffee or there is something else going on here that hasn't come to
> light yet.

When mixing Unicode with byte strings, Python attempts to decode the 
byte string, not encode the Unicode string.

In this case, Python first inserts the non-ASCII byte string in "%s -> 
%s" and gets a byte string.  It then attempts to insert the non-ASCII 
Unicode string, and realizes that it has to convert the (partially 
built) target string to Unicode for that to work.  Which results in a 
*UnicodeDecodeError*.

 >>> "%s -> %s" % ("åäö", "åäö")
'\x86\x84\x94 -> \x86\x84\x94'

 >>> "%s -> %s" % (u"åäö", u"åäö")
u'\xe5\xe4\xf6 -> \xe5\xe4\xf6'

 >>> "%s -> %s" % ("åäö", u"åäö")
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x86 ...

(the actual implementation differs a bit from the description above, but 
the behaviour is identical).

</F>