"Decoding unicode is not supported" in unusual situation
steve+comp.lang.python at pearwood.info
Sat Mar 10 01:57:50 CET 2012
On Fri, 09 Mar 2012 10:11:58 -0800, John Nagle wrote:
> On 3/8/2012 2:58 PM, Prasad, Ramit wrote:
>>> Right. The real problem is that Python 2.7 doesn't have distinct
>>> "str" and "bytes" types. type(bytes() returns<type 'str'> "str" is
>>> assumed to be ASCII 0..127, but that's not enforced. "bytes" and "str"
>>> should have been distinct types, but that would have broken much old
>>> code. If they were distinct, then constructors could distinguish
>>> between string type conversion (which requires no encoding
>>> information) and byte stream decoding.
>>> So it's possible to get junk characters in a "str", and they
>>> won't convert to Unicode. I've had this happen with databases which
>>> were supposed to be ASCII, but occasionally a non-ASCII character
>>> would slip through.
>> bytes and str are just aliases for each other.
> That's true in Python 2.7, but not in 3.x. From 2.6 forward,
> "bytes" and "str" were slowly being separated. See PEP 358. Some of the
> problems in Python 2.7 come from this ambiguity. Logically, "unicode" of
> "str" should be a simple type conversion from ASCII to Unicode, while
> "unicode" of "bytes" should require an encoding. But because of the
> bytes/str ambiguity in Python 2.6/2.7, the behavior couldn't be
This demonstrates a gross confusion about both Unicode and Python. John,
I honestly don't mean to be rude here, but if you actually believe that
(rather than merely expressing yourself poorly), then it seems to me that
you are desperately misinformed about Unicode and are working on the
basis of some serious misapprehensions about the nature of strings.
I recommend you start with this:
In Python 2.6/2.7, there is no ambiguity between str/bytes. The two names
are aliases for each other. The older name, "str", is a misnomer, since
it *actually* refers to bytes (and always has, all the way back to the
earliest days of Python). At best, it could be read as "byte string" or
"8-bit string", but the emphasis should always be on the *bytes*.
str is NOT "assumed to be ASCII 0..127", and it never has been. Python's
str prior to version 3.0 has *always* been bytes, it just never used that
name. For example, in Python 2.4, help(chr) explicitly supports
characters with ordinal 0...255:
Help on built-in function chr in module __builtin__:
chr(i) -> character
Return a string of one character with ordinal i; 0 <= i < 256.
I can go all the way back to Python 0.9, which was so primitive it didn't
even accept "" as string delimiters, and the str type was still based on
bytes, with explicit support for non-ASCII values:
steve at runes:~/Downloads/python-0.9.1$ ./python0.9.1
>>> print 'This is *not* ASCII \xCA see the non-ASCII byte.'
This is *not* ASCII � see the non-ASCII byte.
Any conversion from bytes (including Python 2 strings) to Unicode is
ALWAYS a decoding operation. It can't possibly be anything else. If you
think that it can be, you don't understand the relationship between
strings, Unicode and bytes.
More information about the Python-list