"Decoding unicode is not supported" in unusual situation

Thu Mar 8 17:23:50 EST 2012

On 3/7/2012 6:18 PM, Ben Finney wrote:
> Steven D'Aprano<steve+comp.lang.python at pearwood.info>  writes:
>
>> On Thu, 08 Mar 2012 08:48:58 +1100, Ben Finney wrote:
>>> I think that's a Python bug. If the latter succeeds as a no-op, the
>>> former should also succeed as a no-op. Neither should ever get any
>>> errors when ‘s’ is a ‘unicode’ object already.
>>
>> No. The semantics of the unicode function (technically: a type
>> constructor) are well-defined, and there are two distinct behaviours:

    Right. The real problem is that Python 2.7 doesn't have distinct
"str" and "bytes" types.  type(bytes() returns <type 'str'>
"str" is assumed to be ASCII 0..127, but that's not enforced.
"bytes" and "str" should have been distinct types, but
that would have broken much old code.  If they were distinct, then
constructors could distinguish between string type conversion
(which requires no encoding information) and byte stream decoding.

    So it's possible to get junk characters in a "str", and they
won't convert to Unicode.  I've had this happen with databases which
were supposed to be ASCII, but occasionally a non-ASCII character
would slip through.

    This is all different in Python 3.x, where "str" is Unicode and
"bytes" really are a distinct type.

				John Nagle