"Decoding unicode is not supported" in unusual situation

Wed Mar 7 18:26:37 EST 2012

On Thu, 08 Mar 2012 08:48:58 +1100, Ben Finney wrote:

> John Nagle <nagle at animats.com> writes:
> 
>>    The library bug, if any, is that you can't apply
>>
>> 	unicode(s, errors='replace')
>>
>> to a Unicode string. TypeError("Decoding unicode is not supported") is
>> raised.  However
>>
>>   	unicode(s)
>>
>> will accept Unicode input.
> 
> I think that's a Python bug. If the latter succeeds as a no-op, the
> former should also succeed as a no-op. Neither should ever get any
> errors when ‘s’ is a ‘unicode’ object already.

No. The semantics of the unicode function (technically: a type 
constructor) are well-defined, and there are two distinct behaviours:

unicode(obj)

is analogous to str(obj), and it attempts to convert obj to a unicode 
string by calling obj.__unicode__, if it exists, or __str__ if it 
doesn't. No encoding or decoding is attempted in the event that obj is a 
unicode instance.

unicode(obj, encoding, errors) 

is explicitly stated in the docs as decoding obj if EITHER of encoding or 
errors is given, AND that obj must be either an 8-bit string (bytes) or a 
buffer object.

It is true that u''.decode() will succeed, in Python 2, but the fact that 
unicode objects have a decode method at all is IMO a bug. It has also 
been corrected in Python 3, where (unicode) str objects no longer have a 
decode method, and bytes objects no longer have an encode method.

>> The Python documentation
>> ("http://docs.python.org/library/functions.html#unicode") does not
>> mention this.

Yes it does. It is is the SECOND sentence, immediately after the summary 
line:

unicode([object[, encoding[, errors]]])
    Return the Unicode string version of object using one of the
    following modes:

    If encoding and/or errors are given, unicode() will decode the object
    which can either be an 8-bit string or a character buffer using the
    codec for encoding. ...

Admittedly, it doesn't *explicitly* state that TypeError will be raised, 
but what other exception kind would you expect when you supply an 
argument of the wrong type?

>> It is therefore necessary to check the type before
>> calling "unicode", or catch the undocumented TypeError exception
>> afterward.
> 
> Yes, this check should not be necessary; calling the ‘unicode’
> constructor with an object that's already an instance of ‘unicode’
> should just return the object as-is, IMO. It shouldn't matter that
> you've specified how decoding errors are to be handled, because in that
> case no decoding happens anyway.

I don't believe that it is the job of unicode() to Do What I Mean, but 
only to Do What I Say. If I *explicitly* tell unicode() to decode the 
argument (by specifying either the codec or the error handler or both) 
then it should not double-guess me and ignore the extra parameters.

End-user applications may, with care, try to be smart and DWIM, but 
library functions should be dumb and should do what they are told.

-- 
Steven