"Decoding unicode is not supported" in unusual situation

Terry Reedy tjreedy at udel.edu
Thu Mar 8 01:03:41 CET 2012


On 3/7/2012 6:26 PM, Steven D'Aprano wrote:
> On Thu, 08 Mar 2012 08:48:58 +1100, Ben Finney wrote:
>
>> John Nagle<nagle at animats.com>  writes:
>>
>>>     The library bug, if any, is that you can't apply
>>>
>>> 	unicode(s, errors='replace')
>>>
>>> to a Unicode string. TypeError("Decoding unicode is not supported") is
>>> raised.  However
>>>
>>>    	unicode(s)
>>>
>>> will accept Unicode input.
>>
>> I think that's a Python bug. If the latter succeeds as a no-op, the
>> former should also succeed as a no-op. Neither should ever get any
>> errors when ‘s’ is a ‘unicode’ object already.

> No. The semantics of the unicode function (technically: a type
> constructor) are well-defined, and there are two distinct behaviours:
>
> unicode(obj)
>
> is analogous to str(obj), and it attempts to convert obj to a unicode
> string by calling obj.__unicode__, if it exists, or __str__ if it
> doesn't. No encoding or decoding is attempted in the event that obj is a
> unicode instance.
>
> unicode(obj, encoding, errors)
>
> is explicitly stated in the docs as decoding obj if EITHER of encoding or
> errors is given, AND that obj must be either an 8-bit string (bytes) or a
> buffer object.
>
> It is true that u''.decode() will succeed, in Python 2, but the fact that
> unicode objects have a decode method at all is IMO a bug. It has also

I believe that is because in Py 2, codecs and .encode/.decode were used 
for same type recoding like base64, uu coding. That was simplified in 
Py3 so that 'decoding' is bytes to string and 'encoding' is string to 
bytes, and base64, etc, are only done in their separate modules and not 
also duplicated in the codecs machinery.

> been corrected in Python 3, where (unicode) str objects no longer have a
> decode method, and bytes objects no longer have an encode method.
>
>
>>> The Python documentation
>>> ("http://docs.python.org/library/functions.html#unicode") does not
>>> mention this.
>
> Yes it does. It is is the SECOND sentence, immediately after the summary
> line:
>
> unicode([object[, encoding[, errors]]])
>      Return the Unicode string version of object using one of the
>      following modes:
>
>      If encoding and/or errors are given, unicode() will decode the object
>      which can either be an 8-bit string or a character buffer using the
>      codec for encoding. ...
>
>
> Admittedly, it doesn't *explicitly* state that TypeError will be raised,
> but what other exception kind would you expect when you supply an
> argument of the wrong type?

What you have correctly pointed out is that there is no discrepancy 
between doc and behavior and hence no bug for the purpose of the 
tracker. Thanks.

>>> It is therefore necessary to check the type before
>>> calling "unicode", or catch the undocumented TypeError exception
>>> afterward.
>>
>> Yes, this check should not be necessary; calling the ‘unicode’
>> constructor with an object that's already an instance of ‘unicode’
>> should just return the object as-is, IMO. It shouldn't matter that
>> you've specified how decoding errors are to be handled, because in that
>> case no decoding happens anyway.
>
> I don't believe that it is the job of unicode() to Do What I Mean, but
> only to Do What I Say. If I *explicitly* tell unicode() to decode the
> argument (by specifying either the codec or the error handler or both)
> then it should not double-guess me and ignore the extra parameters.
>
> End-user applications may, with care, try to be smart and DWIM, but
> library functions should be dumb and should do what they are told.

-- 
Terry Jan Reedy





More information about the Python-list mailing list