How does unicode() work?
fredrik at pythonware.com
Wed Jan 9 15:33:38 CET 2008
Carsten Haese wrote:
> If that really is the line that barfs, wouldn't it make more sense to
> repr() the unicode object in the second position?
> import sys
> for k in sys.stdin:
> print '%s -> %s' % (k, repr(k.decode('iso-8859-1')))
> Also, I'm not sure if the OP has told us the truth about his code and/or
> his error message. The implicit str() call done by formatting a unicode
> object with %s would raise a UnicodeEncodeError, not the
> UnicodeDecodeError that the OP is reporting. So either I need more
> coffee or there is something else going on here that hasn't come to
> light yet.
When mixing Unicode with byte strings, Python attempts to decode the
byte string, not encode the Unicode string.
In this case, Python first inserts the non-ASCII byte string in "%s ->
%s" and gets a byte string. It then attempts to insert the non-ASCII
Unicode string, and realizes that it has to convert the (partially
built) target string to Unicode for that to work. Which results in a
>>> "%s -> %s" % ("åäö", "åäö")
'\x86\x84\x94 -> \x86\x84\x94'
>>> "%s -> %s" % (u"åäö", u"åäö")
u'\xe5\xe4\xf6 -> \xe5\xe4\xf6'
>>> "%s -> %s" % ("åäö", u"åäö")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x86 ...
(the actual implementation differs a bit from the description above, but
the behaviour is identical).
More information about the Python-list