"Decoding unicode is not supported" in unusual situation
steve+comp.lang.python at pearwood.info
Thu Mar 8 00:26:37 CET 2012
On Thu, 08 Mar 2012 08:48:58 +1100, Ben Finney wrote:
> John Nagle <nagle at animats.com> writes:
>> The library bug, if any, is that you can't apply
>> unicode(s, errors='replace')
>> to a Unicode string. TypeError("Decoding unicode is not supported") is
>> raised. However
>> will accept Unicode input.
> I think that's a Python bug. If the latter succeeds as a no-op, the
> former should also succeed as a no-op. Neither should ever get any
> errors when ‘s’ is a ‘unicode’ object already.
No. The semantics of the unicode function (technically: a type
constructor) are well-defined, and there are two distinct behaviours:
is analogous to str(obj), and it attempts to convert obj to a unicode
string by calling obj.__unicode__, if it exists, or __str__ if it
doesn't. No encoding or decoding is attempted in the event that obj is a
unicode(obj, encoding, errors)
is explicitly stated in the docs as decoding obj if EITHER of encoding or
errors is given, AND that obj must be either an 8-bit string (bytes) or a
It is true that u''.decode() will succeed, in Python 2, but the fact that
unicode objects have a decode method at all is IMO a bug. It has also
been corrected in Python 3, where (unicode) str objects no longer have a
decode method, and bytes objects no longer have an encode method.
>> The Python documentation
>> ("http://docs.python.org/library/functions.html#unicode") does not
>> mention this.
Yes it does. It is is the SECOND sentence, immediately after the summary
unicode([object[, encoding[, errors]]])
Return the Unicode string version of object using one of the
If encoding and/or errors are given, unicode() will decode the object
which can either be an 8-bit string or a character buffer using the
codec for encoding. ...
Admittedly, it doesn't *explicitly* state that TypeError will be raised,
but what other exception kind would you expect when you supply an
argument of the wrong type?
>> It is therefore necessary to check the type before
>> calling "unicode", or catch the undocumented TypeError exception
> Yes, this check should not be necessary; calling the ‘unicode’
> constructor with an object that's already an instance of ‘unicode’
> should just return the object as-is, IMO. It shouldn't matter that
> you've specified how decoding errors are to be handled, because in that
> case no decoding happens anyway.
I don't believe that it is the job of unicode() to Do What I Mean, but
only to Do What I Say. If I *explicitly* tell unicode() to decode the
argument (by specifying either the codec or the error handler or both)
then it should not double-guess me and ignore the extra parameters.
End-user applications may, with care, try to be smart and DWIM, but
library functions should be dumb and should do what they are told.
More information about the Python-list