[Python-Dev] str() vs. unicode()

Mon, 24 Sep 2001 17:32:50 +0200

Guido van Rossum wrote:
> 
> > > Well, historically, str() has rarely raised exceptions, because
> > > there's a default implementation (same as for repr(), returning <FOO
> > > object at ADDRESS>.  This is used when neither tp_repr nor tp_str is
> > > set.  Note that PyObject_Str() never looks at __str__ -- this is done
> > > by the tp_str handler of instances (and now also by the tp_str handler
> > > of new-style classes).  I see no reason to change this.
> >
> > Me neither; what str() does not do (and unicode() does) is try
> > the char buffer interface before trying tp_str.
> 
> The meanings of these two are different: tp_str means "give me a
> string that's useful for printing"; the buffer API means "let me treat
> you as a sequence of 8-bit bytes (or 8-bit characters)".  They are
> different e.g. when you consider a PIL image, whose str() probably
> returns something like '<PIL image WxHxD>' while its buffer API
> probably gives access to the raw image buffer.
> 
> The str() function should map directly to tp_str().  You *might* claim
> that the 8-bit string type constructor *ought to* look at the buffer
> API, but I'd say that it's easy enough for a type to provide a tp_str
> implementation that does what the type wants.  I guess "convert
> yourself to string" is different than "display yourself as a string".

Sure is :-)

Ok, so let's leave remove the buffer API check from the list of str()/
unicode() conversion checks.

> > > The question then becomes, do we want unicode() to behave similarly?
> >
> > Given that porting an application from strings to Unicode should
> > be easy, I'd say: yes.
> 
> Fearing this ends up being a trick question, I'll say +0.  If we end
> up with something I don't like, I reserve the right to change my
> opinion on this.

Ok.

> > > The str() function (now constructor) converts *anything* to a string;
> > > tp_str and tp_repr exist to allow objects to customize this.  These
> > > slots, and the str() function, take no additional arguments.  To
> > > invoke the equivalent of str() from C, you call PyObject_Str().  I see
> > > no reason to change this; we may want to make the Unicode situation is
> > > similar as possible.
> >
> > Right.
> >
> > > The unicode() function (now constructor) traditionally converted only
> > > 8-bit strings to Unicode strings,
> >
> > Slightly incorrect: it converted 8-bit strings, objects compatible
> > to the char buffer interface and instances having a __str__ method to
> > Unicode.
> 
> That's rather random collection of APIs, if you ask me...

It was modelled after the PyObject_Str() API at the time. Don't
know how the buffer interface ended up in there, but I guess
it was a left-over from early revisions in the design.

> Also, do you really mean *instances* (i.e. objects for which
> PyInstance_Check() returns true), or do you mean anything for which
> getattr(x, "__str__") is true?

Looking at the code from Python 2.1: 
                if (!PyInstance_Check(v) ||
		    (func = PyObject_GetAttr(v, strstr)) == NULL) {
			PyErr_Clear();
			res = PyObject_Repr(v);
		}
		else {
		    	res = PyEval_CallObject(func, (PyObject *)NULL);
			Py_DECREF(func);
		}

... instances which have the __str__ attribute.

> If the latter, you're in for a
> surprise in 2.2 -- almost all built-in objects now respond to that
> method, due to the type/class unification: whenever something has a
> tp_str slot, a __str__ attribute is synthesized (and vice versa).
> (Exceptions are a few obscure types and maybe 3rd party extension
> types.)

Nice :-)

> > To synchronize unicode() with str() we'd have to replace the __str__
> > lookup with a tp_str lookup (this will also allow things like unicode(2)
> > and unicode(instance_having__str__)) and maybe also add the charbuf
> > lookup to str() (this would make str() compatible with memory mapped
> > files and probably a few other char buffer aware objects as well).
> 
> I definitely don't want the latter change to str(); see above.  If you
> want unicode(x) to behave as much as str(x) as possible, I recommend
> removing using the buffer API.

Ok, let's remove the buffer API from unicode(). Should it still be
maintained for unicode(obj, encoding, errors) ?

> > Note that in a discussion we had some time ago we decided that __str__
> > should be allowed to return Unicode objects as well (instead of
> > defining a separate __unicode__ method/slot for this purpose). str()
> > will convert a Unicode return value to an 8-bit string using the
> > default encoding while unicode() takes the return value as-is.
> >
> > This was done to simplify moving from strings to Unicode.
> 
> I'm now not so sure if this was the right decision.

Hmm, perhaps we do need a __unicode__/tp_unicode slot after all. 
It would certainly help clarify the communication between the 
interpreter and the object.

> > > with additional arguments to specify
> > > the encoding (and error handling preference).  There is no tp_unicode
> > > slot, but for some reason there are at least three C APIs that could
> > > correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject()
> > > take a single object argument, and PyObject_FromEncodedObject() takes
> > > object, encoding, and error arguments.
> > >
> > > The first question is, do we want the unicode() constructor to be
> > > applicable in all cases where the str() constructor is?
> >
> > Yes.
> >
> > > I guess that
> > > we do, since we want to be able to print to streams that support
> > > Unicode.  Unicode strings render themselves as Unicode characters to
> > > such a stream, and it's reasonable to allow other objects to also
> > > customize their rendition in Unicode.
> > >
> > > Now, what should be the signature of this conversion?  If we print
> > > object X to a Unicode stream, should we invoke unicode(X), or
> > > unicode(X, encoding, error)?  I believe it should be just unicode(X),
> > > since the encoding used by the stream shouldn't enter into the picture
> > > here: that's just used for converting Unicode characters written to
> > > the stream to some external format.
> > >
> > > How should an object be allowed to customize its Unicode rendition?
> > > We could add a tp_unicode slot to the type object, but there's no
> > > need: we can just look for a __unicode__ method and call it if it
> > > exists.  The signature of __unicode__ should take no further
> > > arguments: unicode(X) should call X.__unicode__().  As a fallback, if
> > > the object doesn't have a __unicode__ attribute, PyObject_Str() should
> > > be called and the resulting string converted to Unicode using the
> > > default encoding.
> >
> > I'd rather leave things as they are: __str__/tp_str are allowed
> > to return Unicode objects and if an object wishes to be rendered
> > as Unicode it can simply return a Unicode object through the
> > __str__/tp_str interface.
> 
> Can you explain your motivation?  In the long run, it seems better to
> me to think of __str__ as "render as 8-bit string" and __unicode__ as
> "render as Unicode string".

The motivation was the idea of a unification of strings and Unicode.
You may be right, though, that this idea is not really practical.

> > > Regarding the "long form" of unicode(), unicode(X, encoding, error), I
> > > see no reason to treat this with the same generality.  This form
> > > should restrict X to something that supports the buffer API (IOW,
> > > 8-bit string objects and things that are treated the same as these in
> > > most situations).
> >
> > Hmm, but this would restrict users from implementing string like
> > objects (i.e. objects having the __str__ method to make it compatible
> > to str()).
> 
> Having __str__ doesn't make something a string-like object!  A
> string-like object (at least the way I understand this term) would
> behave like a string, e.g. have string methods.  The UserString module
> is an example, and in 2.2 subclasses of the 'str' type are prime
> examples.
> 
> To convert one of these to Unicode given an encoding, shouldn't their
> decode() method be used?

Right... perhaps we don't need __unicode__ after all: the .decode()
method already provides this functionality (on strings at least).

> > > (Note that it already balks when X is a Unicode
> > > string.)
> >
> > True -- since you normally cannot decode Unicode into Unicode using
> > some 8-bit character encoding. As a result encodings which convert
> > Unicode to Unicode (e.g. normalizations) cannot use this interface,
> > but since these are probably only rarely used, I think it's better
> > to prevent accidental usage of an 8-bit character codec on Unicode.
> 
> Sigh.  More special cases.  Unicode objects do have a tp_str/__str__
> slot, but they are not acceptable to unicode().
> 
> Really, this is such an incredible morass of APIs that I wonder if we
> shouldn't start over...  There are altogether too many places in the
> code where PyUnicode_Check() is used.  I wish there was a better
> way...

Ideally, we'd need a new base class for strings and then have 8-bit 
and Unicode be subclasses of the this base class. There are several
problems with this approach though; one certainly being the different
memory allocation mechanisms used (strings store the value in the
object, Unicode references an external buffer), the other
being the different nature: strings don't carry meta-information
while Unicode is in many ways restricted in use.

> > > So about those C APIs: I propose that PyObject_Unicode() correspond to
> > > the one-arg form of unicode(), taking any kind of object, and that
> > > PyUnicode_FromEncodedObject() correspond to the three-arg form.
> >
> > Ok. I'll fix this once we've reached consensus on what to do
> > about str() and unicode().
> 
> Alas, this is harder than we seem to have thought, collectively.  I
> want someone to sit back and rethink how this should eventually work
> (say in Python 2.9), and then work backwards from there to a
> reasonable API to be used in 2.2.  The current piling of hack upon
> hack seems hopeless.

Agreed.

> We have some time: 2.2a4 will be released this week, but 2.2b1 isn't
> due until Oct 10, and we can even slip that a bit.  Compatibility with
> previous 2.2 alpha releases in not necessary; the hard compatibility
> baseline is 2.1.1.
> 
> > > PyUnicode_FromObject() shouldn't really need to exist.
> >
> > Note: PyUnicode_FromObject() was extended by PyUnicode_FromEncodedObject()
> > and only exists for backward compatibility reasons.
> 
> Excellent.

I would like to boil this down to one API if possible which then
implements unicode(obj) and unicode(obj, encoding, errors) -- if
no encoding is given the semantics of PyObject_Str() are closely
followed, with encoding the semantics of PyUnicode_FromEncodedObject()
as it was are used (with the buffer interface logic removed).

In a first step, I'd use the tp_str/__str__ for unicode(obj) as
well. Later we can add a tp_unicode/__unicode__ lookup before
trying tp_str/__str__ as fallback.

If this sounds reasonable, I'll give it a go...

> > > I don't see a
> > > reason for PyUnicode_From[Encoded]Object() to use the __unicode__
> > > customization -- it should just take the bytes provided by the object
> > > and decode them according to the given encoding.  PyObject_Unicode(),
> > > on the other hand, should look for __unicode__ first and then
> > > PyObject_Str().
> > >
> > > I hope this helps.
> >
> > Thanks for the summary.
> 
> Alas, we're not done. :-(
> 
> I don't have much time for this -- there still are important pieces of
> the type/class unification missing (e.g. comparisons and pickling
> don't work right, and _ must be able to make __dynamic__ the default).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/