"Decoding unicode is not supported" in unusual situation

John Nagle nagle at animats.com
Fri Mar 9 13:11:58 EST 2012


On 3/8/2012 2:58 PM, Prasad, Ramit wrote:
>>      Right. The real problem is that Python 2.7 doesn't have distinct
>> "str" and "bytes" types.  type(bytes() returns<type 'str'>
>> "str" is assumed to be ASCII 0..127, but that's not enforced.
>> "bytes" and "str" should have been distinct types, but
>> that would have broken much old code.  If they were distinct, then
>> constructors could distinguish between string type conversion
>> (which requires no encoding information) and byte stream decoding.
>>
>>      So it's possible to get junk characters in a "str", and they
>> won't convert to Unicode.  I've had this happen with databases which
>> were supposed to be ASCII, but occasionally a non-ASCII character
>> would slip through.
>
> bytes and str are just aliases for each other.

    That's true in Python 2.7, but not in 3.x.  From 2.6 forward,
"bytes" and "str" were slowly being separated.  See PEP 358.
Some of the problems in Python 2.7 come from this ambiguity.
Logically, "unicode" of "str" should be a simple type conversion
from ASCII to Unicode, while "unicode" of "bytes" should
require an encoding.  But because of the bytes/str ambiguity
in Python 2.6/2.7, the behavior couldn't be type-based.

				John Nagle



More information about the Python-list mailing list