[Python-Dev] str() vs. unicode()

Guido van Rossum guido@python.org
Fri, 21 Sep 2001 10:59:27 -0400


> I'd like to query for the common opinion on an issue which I've
> run into when trying to resynchronize unicode() and str() in terms
> on what happens when you pass arbitrary objects to these constructors
> which happen to implement tp_str (or __str__ for instances).
> 
> Currenty, str() will accept any object which supports the tp_str
> interface and revert to tp_repr in case that slot should not
> be available.
> 
> unicode() supported strings, character buffers and instances
> having a __str__ method before yesterdays checkins.
> 
> Now the goal of the checkins was to make str() and unicode()
> behave in a more compatible fashion. Both should accept
> the same kinds of objects and raise exceptions for all others.

Well, historically, str() has rarely raised exceptions, because
there's a default implementation (same as for repr(), returning <FOO
object at ADDRESS>.  This is used when neither tp_repr nor tp_str is
set.  Note that PyObject_Str() never looks at __str__ -- this is done
by the tp_str handler of instances (and now also by the tp_str handler
of new-style classes).  I see no reason to change this.

The question then becomes, do we want unicode() to behave similarly?

> The path I chose was to fix PyUnicode_FromEncodedObject()
> to also accept tp_str compatible objects. This API is used
> by the unicode_new() constructor (which is exposed as unicode()
> in Python) to create a Unicode object from the input object.
> 
> str() OTOH uses PyObject_Str() via string_new().
> 
> Now there also is a PyObject_Unicode() API which tries to
> mimic PyObject_Str(). However, it does not support the additional
> encoding and errors arguments which the unicode() constructor
> has.
> 
> The problem which Guido raised about my checkins was that
> the changes to PyUnicode_FromEncodedObject() are seen not
> only in unicode(), but also all other instances where this
> API is used.
> 
> OTOH, PyUnicode_FromEncodedObject() is the most generic constructor
> for Unicode objects there currently is in Python.
> 
> So the questions are
> - should I revert the change in PyUnicode_FromEncodedObject()
>   and instead extend PyObject_Unicode() to support encodings ?
> - should we make PyUnicode_Object() use 
>   PyUnicode_FromEncodedObject() instead of providing its
>   own implementation ?
> 
> The overall picture of all this auto-conversion stuff going
> on in str() and unicode() is very confusing. Perhaps what
> we really need is first to agree on a common understanding
> of which auto-conversion should take place and then make
> str() and unicode() support exactly the same interface ?!
> 
> PS: Also see patch #446754 by Walter Dörwald:
> http://sourceforge.net/tracker/?func=detail&atid=305470&aid=446754&group_id=5470

OK, let's take a step back.

The str() function (now constructor) converts *anything* to a string;
tp_str and tp_repr exist to allow objects to customize this.  These
slots, and the str() function, take no additional arguments.  To
invoke the equivalent of str() from C, you call PyObject_Str().  I see
no reason to change this; we may want to make the Unicode situation is
similar as possible.

The unicode() function (now constructor) traditionally converted only
8-bit strings to Unicode strings, with additional arguments to specify
the encoding (and error handling preference).  There is no tp_unicode
slot, but for some reason there are at least three C APIs that could
correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject()
take a single object argument, and PyObject_FromEncodedObject() takes
object, encoding, and error arguments.

The first question is, do we want the unicode() constructor to be
applicable in all cases where the str() constructor is?  I guess that
we do, since we want to be able to print to streams that support
Unicode.  Unicode strings render themselves as Unicode characters to
such a stream, and it's reasonable to allow other objects to also
customize their rendition in Unicode.

Now, what should be the signature of this conversion?  If we print
object X to a Unicode stream, should we invoke unicode(X), or
unicode(X, encoding, error)?  I believe it should be just unicode(X),
since the encoding used by the stream shouldn't enter into the picture
here: that's just used for converting Unicode characters written to
the stream to some external format.

How should an object be allowed to customize its Unicode rendition?
We could add a tp_unicode slot to the type object, but there's no
need: we can just look for a __unicode__ method and call it if it
exists.  The signature of __unicode__ should take no further
arguments: unicode(X) should call X.__unicode__().  As a fallback, if
the object doesn't have a __unicode__ attribute, PyObject_Str() should
be called and the resulting string converted to Unicode using the
default encoding.

Regarding the "long form" of unicode(), unicode(X, encoding, error), I
see no reason to treat this with the same generality.  This form
should restrict X to something that supports the buffer API (IOW,
8-bit string objects and things that are treated the same as these in
most situations).  (Note that it already balks when X is a Unicode
string.)

So about those C APIs: I propose that PyObject_Unicode() correspond to
the one-arg form of unicode(), taking any kind of object, and that
PyUnicode_FromEncodedObject() correspond to the three-arg form.
PyUnicode_FromObject() shouldn't really need to exist.  I don't see a
reason for PyUnicode_From[Encoded]Object() to use the __unicode__
customization -- it should just take the bytes provided by the object
and decode them according to the given encoding.  PyObject_Unicode(),
on the other hand, should look for __unicode__ first and then
PyObject_Str().

I hope this helps.

--Guido van Rossum (home page: http://www.python.org/~guido/)