Mailman 3 str() vs. unicode() - Python-Dev

str() vs. unicode()

M.-A. Lemburg

21 Sep 2001 21 Sep '01

3:44 p.m.

I'd like to query for the common opinion on an issue which I've run into when trying to resynchronize unicode() and str() in terms on what happens when you pass arbitrary objects to these constructors which happen to implement tp_str (or __str__ for instances). Currenty, str() will accept any object which supports the tp_str interface and revert to tp_repr in case that slot should not be available. unicode() supported strings, character buffers and instances having a __str__ method before yesterdays checkins. Now the goal of the checkins was to make str() and unicode() behave in a more compatible fashion. Both should accept the same kinds of objects and raise exceptions for all others. The path I chose was to fix PyUnicode_FromEncodedObject() to also accept tp_str compatible objects. This API is used by the unicode_new() constructor (which is exposed as unicode() in Python) to create a Unicode object from the input object. str() OTOH uses PyObject_Str() via string_new(). Now there also is a PyObject_Unicode() API which tries to mimic PyObject_Str(). However, it does not support the additional encoding and errors arguments which the unicode() constructor has. The problem which Guido raised about my checkins was that the changes to PyUnicode_FromEncodedObject() are seen not only in unicode(), but also all other instances where this API is used. OTOH, PyUnicode_FromEncodedObject() is the most generic constructor for Unicode objects there currently is in Python. So the questions are - should I revert the change in PyUnicode_FromEncodedObject() and instead extend PyObject_Unicode() to support encodings ? - should we make PyUnicode_Object() use PyUnicode_FromEncodedObject() instead of providing its own implementation ? The overall picture of all this auto-conversion stuff going on in str() and unicode() is very confusing. Perhaps what we really need is first to agree on a common understanding of which auto-conversion should take place and then make str() and unicode() support exactly the same interface ?! PS: Also see patch #446754 by Walter Dörwald: http://sourceforge.net/tracker/?func=detail&atid=305470&aid=446754&group_id=5470 -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Show replies by date

Guido van Rossum

21 Sep 21 Sep

4:59 p.m.

...

I'd like to query for the common opinion on an issue which I've run into when trying to resynchronize unicode() and str() in terms on what happens when you pass arbitrary objects to these constructors which happen to implement tp_str (or __str__ for instances).

Currenty, str() will accept any object which supports the tp_str interface and revert to tp_repr in case that slot should not be available.

unicode() supported strings, character buffers and instances having a __str__ method before yesterdays checkins.

Now the goal of the checkins was to make str() and unicode() behave in a more compatible fashion. Both should accept the same kinds of objects and raise exceptions for all others.

Well, historically, str() has rarely raised exceptions, because there's a default implementation (same as for repr(), returning <FOO object at ADDRESS>. This is used when neither tp_repr nor tp_str is set. Note that PyObject_Str() never looks at __str__ -- this is done by the tp_str handler of instances (and now also by the tp_str handler of new-style classes). I see no reason to change this. The question then becomes, do we want unicode() to behave similarly?

...

The path I chose was to fix PyUnicode_FromEncodedObject() to also accept tp_str compatible objects. This API is used by the unicode_new() constructor (which is exposed as unicode() in Python) to create a Unicode object from the input object.

str() OTOH uses PyObject_Str() via string_new().

Now there also is a PyObject_Unicode() API which tries to mimic PyObject_Str(). However, it does not support the additional encoding and errors arguments which the unicode() constructor has.

The problem which Guido raised about my checkins was that the changes to PyUnicode_FromEncodedObject() are seen not only in unicode(), but also all other instances where this API is used.

OTOH, PyUnicode_FromEncodedObject() is the most generic constructor for Unicode objects there currently is in Python.

So the questions are - should I revert the change in PyUnicode_FromEncodedObject() and instead extend PyObject_Unicode() to support encodings ? - should we make PyUnicode_Object() use PyUnicode_FromEncodedObject() instead of providing its own implementation ?

The overall picture of all this auto-conversion stuff going on in str() and unicode() is very confusing. Perhaps what we really need is first to agree on a common understanding of which auto-conversion should take place and then make str() and unicode() support exactly the same interface ?!

PS: Also see patch #446754 by Walter Dörwald: http://sourceforge.net/tracker/?func=detail&atid=305470&aid=446754&group_id=5470

OK, let's take a step back. The str() function (now constructor) converts *anything* to a string; tp_str and tp_repr exist to allow objects to customize this. These slots, and the str() function, take no additional arguments. To invoke the equivalent of str() from C, you call PyObject_Str(). I see no reason to change this; we may want to make the Unicode situation is similar as possible. The unicode() function (now constructor) traditionally converted only 8-bit strings to Unicode strings, with additional arguments to specify the encoding (and error handling preference). There is no tp_unicode slot, but for some reason there are at least three C APIs that could correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject() take a single object argument, and PyObject_FromEncodedObject() takes object, encoding, and error arguments. The first question is, do we want the unicode() constructor to be applicable in all cases where the str() constructor is? I guess that we do, since we want to be able to print to streams that support Unicode. Unicode strings render themselves as Unicode characters to such a stream, and it's reasonable to allow other objects to also customize their rendition in Unicode. Now, what should be the signature of this conversion? If we print object X to a Unicode stream, should we invoke unicode(X), or unicode(X, encoding, error)? I believe it should be just unicode(X), since the encoding used by the stream shouldn't enter into the picture here: that's just used for converting Unicode characters written to the stream to some external format. How should an object be allowed to customize its Unicode rendition? We could add a tp_unicode slot to the type object, but there's no need: we can just look for a __unicode__ method and call it if it exists. The signature of __unicode__ should take no further arguments: unicode(X) should call X.__unicode__(). As a fallback, if the object doesn't have a __unicode__ attribute, PyObject_Str() should be called and the resulting string converted to Unicode using the default encoding. Regarding the "long form" of unicode(), unicode(X, encoding, error), I see no reason to treat this with the same generality. This form should restrict X to something that supports the buffer API (IOW, 8-bit string objects and things that are treated the same as these in most situations). (Note that it already balks when X is a Unicode string.) So about those C APIs: I propose that PyObject_Unicode() correspond to the one-arg form of unicode(), taking any kind of object, and that PyUnicode_FromEncodedObject() correspond to the three-arg form. PyUnicode_FromObject() shouldn't really need to exist. I don't see a reason for PyUnicode_From[Encoded]Object() to use the __unicode__ customization -- it should just take the bytes provided by the object and decode them according to the given encoding. PyObject_Unicode(), on the other hand, should look for __unicode__ first and then PyObject_Str(). I hope this helps. --Guido van Rossum (home page: http://www.python.org/~guido/)

M.-A. Lemburg

22 Sep 22 Sep

6:14 p.m.

Guido van Rossum wrote:

...

...
I'd like to query for the common opinion on an issue which I've run into when trying to resynchronize unicode() and str() in terms on what happens when you pass arbitrary objects to these constructors which happen to implement tp_str (or __str__ for instances).

Currenty, str() will accept any object which supports the tp_str interface and revert to tp_repr in case that slot should not be available.

unicode() supported strings, character buffers and instances having a __str__ method before yesterdays checkins.

Now the goal of the checkins was to make str() and unicode() behave in a more compatible fashion. Both should accept the same kinds of objects and raise exceptions for all others.

Well, historically, str() has rarely raised exceptions, because there's a default implementation (same as for repr(), returning <FOO object at ADDRESS>. This is used when neither tp_repr nor tp_str is set. Note that PyObject_Str() never looks at __str__ -- this is done by the tp_str handler of instances (and now also by the tp_str handler of new-style classes). I see no reason to change this.

Me neither; what str() does not do (and unicode() does) is try the char buffer interface before trying tp_str.

...

The question then becomes, do we want unicode() to behave similarly?

Given that porting an application from strings to Unicode should be easy, I'd say: yes.

...

...
The path I chose was to fix PyUnicode_FromEncodedObject() to also accept tp_str compatible objects. This API is used by the unicode_new() constructor (which is exposed as unicode() in Python) to create a Unicode object from the input object.

str() OTOH uses PyObject_Str() via string_new().

Now there also is a PyObject_Unicode() API which tries to mimic PyObject_Str(). However, it does not support the additional encoding and errors arguments which the unicode() constructor has.

The problem which Guido raised about my checkins was that the changes to PyUnicode_FromEncodedObject() are seen not only in unicode(), but also all other instances where this API is used.

OTOH, PyUnicode_FromEncodedObject() is the most generic constructor for Unicode objects there currently is in Python.

So the questions are - should I revert the change in PyUnicode_FromEncodedObject() and instead extend PyObject_Unicode() to support encodings ? - should we make PyUnicode_Object() use PyUnicode_FromEncodedObject() instead of providing its own implementation ?

The overall picture of all this auto-conversion stuff going on in str() and unicode() is very confusing. Perhaps what we really need is first to agree on a common understanding of which auto-conversion should take place and then make str() and unicode() support exactly the same interface ?!

PS: Also see patch #446754 by Walter Dörwald: http://sourceforge.net/tracker/?func=detail&atid=305470&aid=446754&group_id=5470

OK, let's take a step back.

The str() function (now constructor) converts *anything* to a string; tp_str and tp_repr exist to allow objects to customize this. These slots, and the str() function, take no additional arguments. To invoke the equivalent of str() from C, you call PyObject_Str(). I see no reason to change this; we may want to make the Unicode situation is similar as possible.

Right.

...

The unicode() function (now constructor) traditionally converted only 8-bit strings to Unicode strings,

Slightly incorrect: it converted 8-bit strings, objects compatible to the char buffer interface and instances having a __str__ method to Unicode. To synchronize unicode() with str() we'd have to replace the __str__ lookup with a tp_str lookup (this will also allow things like unicode(2) and unicode(instance_having__str__)) and maybe also add the charbuf lookup to str() (this would make str() compatible with memory mapped files and probably a few other char buffer aware objects as well). Note that in a discussion we had some time ago we decided that __str__ should be allowed to return Unicode objects as well (instead of defining a separate __unicode__ method/slot for this purpose). str() will convert a Unicode return value to an 8-bit string using the default encoding while unicode() takes the return value as-is. This was done to simplify moving from strings to Unicode.

...

with additional arguments to specify the encoding (and error handling preference). There is no tp_unicode slot, but for some reason there are at least three C APIs that could correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject() take a single object argument, and PyObject_FromEncodedObject() takes object, encoding, and error arguments.

The first question is, do we want the unicode() constructor to be applicable in all cases where the str() constructor is?

Yes.

...

I guess that we do, since we want to be able to print to streams that support Unicode. Unicode strings render themselves as Unicode characters to such a stream, and it's reasonable to allow other objects to also customize their rendition in Unicode.

Now, what should be the signature of this conversion? If we print object X to a Unicode stream, should we invoke unicode(X), or unicode(X, encoding, error)? I believe it should be just unicode(X), since the encoding used by the stream shouldn't enter into the picture here: that's just used for converting Unicode characters written to the stream to some external format.

How should an object be allowed to customize its Unicode rendition? We could add a tp_unicode slot to the type object, but there's no need: we can just look for a __unicode__ method and call it if it exists. The signature of __unicode__ should take no further arguments: unicode(X) should call X.__unicode__(). As a fallback, if the object doesn't have a __unicode__ attribute, PyObject_Str() should be called and the resulting string converted to Unicode using the default encoding.

I'd rather leave things as they are: __str__/tp_str are allowed to return Unicode objects and if an object wishes to be rendered as Unicode it can simply return a Unicode object through the __str__/tp_str interface.

...

Regarding the "long form" of unicode(), unicode(X, encoding, error), I see no reason to treat this with the same generality. This form should restrict X to something that supports the buffer API (IOW, 8-bit string objects and things that are treated the same as these in most situations).

Hmm, but this would restrict users from implementing string like objects (i.e. objects having the __str__ method to make it compatible to str()).

...

(Note that it already balks when X is a Unicode string.)

True -- since you normally cannot decode Unicode into Unicode using some 8-bit character encoding. As a result encodings which convert Unicode to Unicode (e.g. normalizations) cannot use this interface, but since these are probably only rarely used, I think it's better to prevent accidental usage of an 8-bit character codec on Unicode.

...

So about those C APIs: I propose that PyObject_Unicode() correspond to the one-arg form of unicode(), taking any kind of object, and that PyUnicode_FromEncodedObject() correspond to the three-arg form.

Ok. I'll fix this once we've reached consensus on what to do about str() and unicode().

...

PyUnicode_FromObject() shouldn't really need to exist.

Note: PyUnicode_FromObject() was extended by PyUnicode_FromEncodedObject() and only exists for backward compatibility reasons.

...

I don't see a reason for PyUnicode_From[Encoded]Object() to use the __unicode__ customization -- it should just take the bytes provided by the object and decode them according to the given encoding. PyObject_Unicode(), on the other hand, should look for __unicode__ first and then PyObject_Str().

I hope this helps.

Thanks for the summary. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Guido van Rossum

24 Sep 24 Sep

3:30 p.m.

...

...
Well, historically, str() has rarely raised exceptions, because there's a default implementation (same as for repr(), returning <FOO object at ADDRESS>. This is used when neither tp_repr nor tp_str is set. Note that PyObject_Str() never looks at __str__ -- this is done by the tp_str handler of instances (and now also by the tp_str handler of new-style classes). I see no reason to change this.

Me neither; what str() does not do (and unicode() does) is try the char buffer interface before trying tp_str.

The meanings of these two are different: tp_str means "give me a string that's useful for printing"; the buffer API means "let me treat you as a sequence of 8-bit bytes (or 8-bit characters)". They are different e.g. when you consider a PIL image, whose str() probably returns something like '<PIL image WxHxD>' while its buffer API probably gives access to the raw image buffer. The str() function should map directly to tp_str(). You *might* claim that the 8-bit string type constructor *ought to* look at the buffer API, but I'd say that it's easy enough for a type to provide a tp_str implementation that does what the type wants. I guess "convert yourself to string" is different than "display yourself as a string".

...

...
The question then becomes, do we want unicode() to behave similarly?

Given that porting an application from strings to Unicode should be easy, I'd say: yes.

Fearing this ends up being a trick question, I'll say +0. If we end up with something I don't like, I reserve the right to change my opinion on this.

...

...
The str() function (now constructor) converts *anything* to a string; tp_str and tp_repr exist to allow objects to customize this. These slots, and the str() function, take no additional arguments. To invoke the equivalent of str() from C, you call PyObject_Str(). I see no reason to change this; we may want to make the Unicode situation is similar as possible.

Right.

...
The unicode() function (now constructor) traditionally converted only 8-bit strings to Unicode strings,

Slightly incorrect: it converted 8-bit strings, objects compatible to the char buffer interface and instances having a __str__ method to Unicode.

That's rather random collection of APIs, if you ask me... Also, do you really mean *instances* (i.e. objects for which PyInstance_Check() returns true), or do you mean anything for which getattr(x, "__str__") is true? If the latter, you're in for a surprise in 2.2 -- almost all built-in objects now respond to that method, due to the type/class unification: whenever something has a tp_str slot, a __str__ attribute is synthesized (and vice versa). (Exceptions are a few obscure types and maybe 3rd party extension types.)

...

To synchronize unicode() with str() we'd have to replace the __str__ lookup with a tp_str lookup (this will also allow things like unicode(2) and unicode(instance_having__str__)) and maybe also add the charbuf lookup to str() (this would make str() compatible with memory mapped files and probably a few other char buffer aware objects as well).

I definitely don't want the latter change to str(); see above. If you want unicode(x) to behave as much as str(x) as possible, I recommend removing using the buffer API.

...

Note that in a discussion we had some time ago we decided that __str__ should be allowed to return Unicode objects as well (instead of defining a separate __unicode__ method/slot for this purpose). str() will convert a Unicode return value to an 8-bit string using the default encoding while unicode() takes the return value as-is.

This was done to simplify moving from strings to Unicode.

I'm now not so sure if this was the right decision.

...

...
with additional arguments to specify the encoding (and error handling preference). There is no tp_unicode slot, but for some reason there are at least three C APIs that could correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject() take a single object argument, and PyObject_FromEncodedObject() takes object, encoding, and error arguments.

The first question is, do we want the unicode() constructor to be applicable in all cases where the str() constructor is?

Yes.

...
I guess that we do, since we want to be able to print to streams that support Unicode. Unicode strings render themselves as Unicode characters to such a stream, and it's reasonable to allow other objects to also customize their rendition in Unicode.

Now, what should be the signature of this conversion? If we print object X to a Unicode stream, should we invoke unicode(X), or unicode(X, encoding, error)? I believe it should be just unicode(X), since the encoding used by the stream shouldn't enter into the picture here: that's just used for converting Unicode characters written to the stream to some external format.

How should an object be allowed to customize its Unicode rendition? We could add a tp_unicode slot to the type object, but there's no need: we can just look for a __unicode__ method and call it if it exists. The signature of __unicode__ should take no further arguments: unicode(X) should call X.__unicode__(). As a fallback, if the object doesn't have a __unicode__ attribute, PyObject_Str() should be called and the resulting string converted to Unicode using the default encoding.

I'd rather leave things as they are: __str__/tp_str are allowed to return Unicode objects and if an object wishes to be rendered as Unicode it can simply return a Unicode object through the __str__/tp_str interface.

Can you explain your motivation? In the long run, it seems better to me to think of __str__ as "render as 8-bit string" and __unicode__ as "render as Unicode string".

...

...
Regarding the "long form" of unicode(), unicode(X, encoding, error), I see no reason to treat this with the same generality. This form should restrict X to something that supports the buffer API (IOW, 8-bit string objects and things that are treated the same as these in most situations).

Hmm, but this would restrict users from implementing string like objects (i.e. objects having the __str__ method to make it compatible to str()).

Having __str__ doesn't make something a string-like object! A string-like object (at least the way I understand this term) would behave like a string, e.g. have string methods. The UserString module is an example, and in 2.2 subclasses of the 'str' type are prime examples. To convert one of these to Unicode given an encoding, shouldn't their decode() method be used?

...

...
(Note that it already balks when X is a Unicode string.)

True -- since you normally cannot decode Unicode into Unicode using some 8-bit character encoding. As a result encodings which convert Unicode to Unicode (e.g. normalizations) cannot use this interface, but since these are probably only rarely used, I think it's better to prevent accidental usage of an 8-bit character codec on Unicode.

Sigh. More special cases. Unicode objects do have a tp_str/__str__ slot, but they are not acceptable to unicode(). Really, this is such an incredible morass of APIs that I wonder if we shouldn't start over... There are altogether too many places in the code where PyUnicode_Check() is used. I wish there was a better way...

...

...
So about those C APIs: I propose that PyObject_Unicode() correspond to the one-arg form of unicode(), taking any kind of object, and that PyUnicode_FromEncodedObject() correspond to the three-arg form.

Ok. I'll fix this once we've reached consensus on what to do about str() and unicode().

Alas, this is harder than we seem to have thought, collectively. I want someone to sit back and rethink how this should eventually work (say in Python 2.9), and then work backwards from there to a reasonable API to be used in 2.2. The current piling of hack upon hack seems hopeless. We have some time: 2.2a4 will be released this week, but 2.2b1 isn't due until Oct 10, and we can even slip that a bit. Compatibility with previous 2.2 alpha releases in not necessary; the hard compatibility baseline is 2.1.1.

...

...
PyUnicode_FromObject() shouldn't really need to exist.

Note: PyUnicode_FromObject() was extended by PyUnicode_FromEncodedObject() and only exists for backward compatibility reasons.

Excellent.

...

...
I don't see a reason for PyUnicode_From[Encoded]Object() to use the __unicode__ customization -- it should just take the bytes provided by the object and decode them according to the given encoding. PyObject_Unicode(), on the other hand, should look for __unicode__ first and then PyObject_Str().

I hope this helps.

Thanks for the summary.

Alas, we're not done. :-( I don't have much time for this -- there still are important pieces of the type/class unification missing (e.g. comparisons and pickling don't work right, and _ must be able to make __dynamic__ the default). --Guido van Rossum (home page: http://www.python.org/~guido/)

M.-A. Lemburg

5:32 p.m.

Guido van Rossum wrote:

...

...
...
Well, historically, str() has rarely raised exceptions, because there's a default implementation (same as for repr(), returning <FOO object at ADDRESS>. This is used when neither tp_repr nor tp_str is set. Note that PyObject_Str() never looks at __str__ -- this is done by the tp_str handler of instances (and now also by the tp_str handler of new-style classes). I see no reason to change this.

Me neither; what str() does not do (and unicode() does) is try the char buffer interface before trying tp_str.

The meanings of these two are different: tp_str means "give me a string that's useful for printing"; the buffer API means "let me treat you as a sequence of 8-bit bytes (or 8-bit characters)". They are different e.g. when you consider a PIL image, whose str() probably returns something like '<PIL image WxHxD>' while its buffer API probably gives access to the raw image buffer.

The str() function should map directly to tp_str(). You *might* claim that the 8-bit string type constructor *ought to* look at the buffer API, but I'd say that it's easy enough for a type to provide a tp_str implementation that does what the type wants. I guess "convert yourself to string" is different than "display yourself as a string".

Sure is :-) Ok, so let's leave remove the buffer API check from the list of str()/ unicode() conversion checks.

...

...
...
The question then becomes, do we want unicode() to behave similarly?

Given that porting an application from strings to Unicode should be easy, I'd say: yes.

Fearing this ends up being a trick question, I'll say +0. If we end up with something I don't like, I reserve the right to change my opinion on this.

Ok.

...

...
...
The str() function (now constructor) converts *anything* to a string; tp_str and tp_repr exist to allow objects to customize this. These slots, and the str() function, take no additional arguments. To invoke the equivalent of str() from C, you call PyObject_Str(). I see no reason to change this; we may want to make the Unicode situation is similar as possible.

Right.

...
The unicode() function (now constructor) traditionally converted only 8-bit strings to Unicode strings,

Slightly incorrect: it converted 8-bit strings, objects compatible to the char buffer interface and instances having a __str__ method to Unicode.

That's rather random collection of APIs, if you ask me...

It was modelled after the PyObject_Str() API at the time. Don't know how the buffer interface ended up in there, but I guess it was a left-over from early revisions in the design.

...

Also, do you really mean *instances* (i.e. objects for which PyInstance_Check() returns true), or do you mean anything for which getattr(x, "__str__") is true?

Looking at the code from Python 2.1: if (!PyInstance_Check(v) || (func = PyObject_GetAttr(v, strstr)) == NULL) { PyErr_Clear(); res = PyObject_Repr(v); } else { res = PyEval_CallObject(func, (PyObject *)NULL); Py_DECREF(func); } ... instances which have the __str__ attribute.

...

If the latter, you're in for a surprise in 2.2 -- almost all built-in objects now respond to that method, due to the type/class unification: whenever something has a tp_str slot, a __str__ attribute is synthesized (and vice versa). (Exceptions are a few obscure types and maybe 3rd party extension types.)

Nice :-)

...

...
To synchronize unicode() with str() we'd have to replace the __str__ lookup with a tp_str lookup (this will also allow things like unicode(2) and unicode(instance_having__str__)) and maybe also add the charbuf lookup to str() (this would make str() compatible with memory mapped files and probably a few other char buffer aware objects as well).

I definitely don't want the latter change to str(); see above. If you want unicode(x) to behave as much as str(x) as possible, I recommend removing using the buffer API.

Ok, let's remove the buffer API from unicode(). Should it still be maintained for unicode(obj, encoding, errors) ?

...

...
Note that in a discussion we had some time ago we decided that __str__ should be allowed to return Unicode objects as well (instead of defining a separate __unicode__ method/slot for this purpose). str() will convert a Unicode return value to an 8-bit string using the default encoding while unicode() takes the return value as-is.

This was done to simplify moving from strings to Unicode.

I'm now not so sure if this was the right decision.

Hmm, perhaps we do need a __unicode__/tp_unicode slot after all. It would certainly help clarify the communication between the interpreter and the object.

...

...
...
with additional arguments to specify the encoding (and error handling preference). There is no tp_unicode slot, but for some reason there are at least three C APIs that could correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject() take a single object argument, and PyObject_FromEncodedObject() takes object, encoding, and error arguments.

The first question is, do we want the unicode() constructor to be applicable in all cases where the str() constructor is?

Yes.

...
I guess that we do, since we want to be able to print to streams that support Unicode. Unicode strings render themselves as Unicode characters to such a stream, and it's reasonable to allow other objects to also customize their rendition in Unicode.

Now, what should be the signature of this conversion? If we print object X to a Unicode stream, should we invoke unicode(X), or unicode(X, encoding, error)? I believe it should be just unicode(X), since the encoding used by the stream shouldn't enter into the picture here: that's just used for converting Unicode characters written to the stream to some external format.

How should an object be allowed to customize its Unicode rendition? We could add a tp_unicode slot to the type object, but there's no need: we can just look for a __unicode__ method and call it if it exists. The signature of __unicode__ should take no further arguments: unicode(X) should call X.__unicode__(). As a fallback, if the object doesn't have a __unicode__ attribute, PyObject_Str() should be called and the resulting string converted to Unicode using the default encoding.

I'd rather leave things as they are: __str__/tp_str are allowed to return Unicode objects and if an object wishes to be rendered as Unicode it can simply return a Unicode object through the __str__/tp_str interface.

Can you explain your motivation? In the long run, it seems better to me to think of __str__ as "render as 8-bit string" and __unicode__ as "render as Unicode string".

The motivation was the idea of a unification of strings and Unicode. You may be right, though, that this idea is not really practical.

...

...
...
Regarding the "long form" of unicode(), unicode(X, encoding, error), I see no reason to treat this with the same generality. This form should restrict X to something that supports the buffer API (IOW, 8-bit string objects and things that are treated the same as these in most situations).

Hmm, but this would restrict users from implementing string like objects (i.e. objects having the __str__ method to make it compatible to str()).

Having __str__ doesn't make something a string-like object! A string-like object (at least the way I understand this term) would behave like a string, e.g. have string methods. The UserString module is an example, and in 2.2 subclasses of the 'str' type are prime examples.

To convert one of these to Unicode given an encoding, shouldn't their decode() method be used?

Right... perhaps we don't need __unicode__ after all: the .decode() method already provides this functionality (on strings at least).

...

...
...
(Note that it already balks when X is a Unicode string.)

True -- since you normally cannot decode Unicode into Unicode using some 8-bit character encoding. As a result encodings which convert Unicode to Unicode (e.g. normalizations) cannot use this interface, but since these are probably only rarely used, I think it's better to prevent accidental usage of an 8-bit character codec on Unicode.

Sigh. More special cases. Unicode objects do have a tp_str/__str__ slot, but they are not acceptable to unicode().

Really, this is such an incredible morass of APIs that I wonder if we shouldn't start over... There are altogether too many places in the code where PyUnicode_Check() is used. I wish there was a better way...

Ideally, we'd need a new base class for strings and then have 8-bit and Unicode be subclasses of the this base class. There are several problems with this approach though; one certainly being the different memory allocation mechanisms used (strings store the value in the object, Unicode references an external buffer), the other being the different nature: strings don't carry meta-information while Unicode is in many ways restricted in use.

...

...
...
So about those C APIs: I propose that PyObject_Unicode() correspond to the one-arg form of unicode(), taking any kind of object, and that PyUnicode_FromEncodedObject() correspond to the three-arg form.

Ok. I'll fix this once we've reached consensus on what to do about str() and unicode().

Alas, this is harder than we seem to have thought, collectively. I want someone to sit back and rethink how this should eventually work (say in Python 2.9), and then work backwards from there to a reasonable API to be used in 2.2. The current piling of hack upon hack seems hopeless.

Agreed.

...

We have some time: 2.2a4 will be released this week, but 2.2b1 isn't due until Oct 10, and we can even slip that a bit. Compatibility with previous 2.2 alpha releases in not necessary; the hard compatibility baseline is 2.1.1.

...
...
PyUnicode_FromObject() shouldn't really need to exist.

Note: PyUnicode_FromObject() was extended by PyUnicode_FromEncodedObject() and only exists for backward compatibility reasons.

Excellent.

I would like to boil this down to one API if possible which then implements unicode(obj) and unicode(obj, encoding, errors) -- if no encoding is given the semantics of PyObject_Str() are closely followed, with encoding the semantics of PyUnicode_FromEncodedObject() as it was are used (with the buffer interface logic removed). In a first step, I'd use the tp_str/__str__ for unicode(obj) as well. Later we can add a tp_unicode/__unicode__ lookup before trying tp_str/__str__ as fallback. If this sounds reasonable, I'll give it a go...

...

...
...
I don't see a reason for PyUnicode_From[Encoded]Object() to use the __unicode__ customization -- it should just take the bytes provided by the object and decode them according to the given encoding. PyObject_Unicode(), on the other hand, should look for __unicode__ first and then PyObject_Str().

I hope this helps.

Thanks for the summary.

Alas, we're not done. :-(

I don't have much time for this -- there still are important pieces of the type/class unification missing (e.g. comparisons and pickling don't work right, and _ must be able to make __dynamic__ the default).

-- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/

Guido van Rossum

25 Sep 25 Sep

7:17 a.m.

...

Ok, let's remove the buffer API from unicode(). Should it still be maintained for unicode(obj, encoding, errors) ?

I think so yes.

...

Hmm, perhaps we do need a __unicode__/tp_unicode slot after all. It would certainly help clarify the communication between the interpreter and the object.

Would you settle for a __unicode__ method but no tp_unicode slot? It's easy enough to define a C method named __unicode__ if the need arises. This should always be tried first, not just for classic instances. Adding a slot is a bit painful now that there are so many new slots already (adding it to the end means you have to add tons of zeros, adding it to the middle means I have to edit every file).

...

...
To convert one of these to Unicode given an encoding, shouldn't their decode() method be used?

Right... perhaps we don't need __unicode__ after all: the .decode() method already provides this functionality (on strings at least).

So maybe we should deprecate unicode(obj, encoding[, error]) and recommend obj.decode(encoding[, error]) instead. But this means that objects with a buffer API but no decode() method cannot efficiently be decoded. That's what unicode(obj, encoding[, error]) was good for. To decide, we need to know how useful it is in practice to be able to decode buffers -- I doubt it is very useful, since most types supporting the buffer API are not text but raw data like memory-mapped files, arrays, PIL images.

...

...
Really, this is such an incredible morass of APIs that I wonder if we shouldn't start over... There are altogether too many places in the code where PyUnicode_Check() is used. I wish there was a better way...

Ideally, we'd need a new base class for strings and then have 8-bit and Unicode be subclasses of the this base class. There are several problems with this approach though; one certainly being the different memory allocation mechanisms used (strings store the value in the object, Unicode references an external buffer), the other being the different nature: strings don't carry meta-information while Unicode is in many ways restricted in use.

I've thought of defining an abstract base class "string" from which both str and unicode derive. Since str and unicode don't share representation, they shouldn't share implementation, but they could still share interface. Certainly conceptually this is how we think of strings. Useless thought: the string class would have unbound methods that are almost the same as the functions defined in the string module, e.g. string.split(s) and string.strip(s) could be made to call s.split() and s.strip(), just like the module. The class could have data attributes for string.whitespace etc. But string.join() would have a different signature: the class method is join(s, list) while the function is join(list, s). So we can't quite make the module an alias for the class. :-(

...

I would like to boil this down to one API if possible which then implements unicode(obj) and unicode(obj, encoding, errors) -- if no encoding is given the semantics of PyObject_Str() are closely followed, with encoding the semantics of PyUnicode_FromEncodedObject() as it was are used (with the buffer interface logic removed).

I would actually recommend using two different C level APIs: PyObject_Unicode() to implement unicode(obj), which should follow str(obj), and PyUnicode_FromEncodedObject() to implement unicode(obj, decoding[, error]), which should use the buffer API on obj.

...

In a first step, I'd use the tp_str/__str__ for unicode(obj) as well. Later we can add a tp_unicode/__unicode__ lookup before trying tp_str/__str__ as fallback.

I would add __unicode__ support without tp_unicode right away. I would use tp_str without even looking at __str__.

...

If this sounds reasonable, I'll give it a go...

Yes. --Guido van Rossum (home page: http://www.python.org/~guido/)

Walter Dörwald

3:36 p.m.

Guido van Rossum wrote:

...

...
Ok, let's remove the buffer API from unicode(). Should it still be maintained for unicode(obj, encoding, errors) ?

I think so yes.

...
Hmm, perhaps we do need a __unicode__/tp_unicode slot after all. It would certainly help clarify the communication between the interpreter and the object.

Would you settle for a __unicode__ method but no tp_unicode slot? It's easy enough to define a C method named __unicode__ if the need arises. This should always be tried first, not just for classic instances. Adding a slot is a bit painful now that there are so many new slots already (adding it to the end means you have to add tons of zeros, adding it to the middle means I have to edit every file).

Hmm, what about a type object initialisation function that takes "named arguments" via varargs: PyType_Initialize(&PyUnicode_Type, TYPE_TYPE, &PyType_Type, TYPE_NAME, "unicode", SLOT_DESTRUCTOR, _PyUnicode_Free, SLOT_CMP, unicode_compare, SLOT_REPR, unicode_repr, SLOT_SEQ, unicode_as_sequence, SLOT_HASH, unicode_hash, DONE ) The SLOT_xxx arguments would be #defines like this #define DONE 0 #define TYPE_TYPE 1 #define TYPE_NAME 2 #define SLOT_DESTRUCTOR 3 #define SLOT_CMP 4 Adding a new slot would require much less work: define a new slot *somewhere* in the struct, define a new SLOT_xxx and add SLOT_xxx, foo_xxx to the call to the initializer for every type that implements this struct. Performance shouldn't be a problem, because this function would only be called once for every type. And we could get rid of the problem with static initialization of ob_type with some compilers.

...

[...]

I would add __unicode__ support without tp_unicode right away.

I like this idea. There is no need to piggyback unicode representation of objects onto tp_str/__str__. Both PyObject_Str and PyObject_Unicode will get much simpler. But we will need int.__unicode__, float.__unicode__ etc. (or fallback to __str__) BTW, what about __repr__? Should this be allowed to return unicode objects? (currently it is, and uses PyUnicode_AsUnicodeEscapeString) Bye, Walter Dörwald

Guido van Rossum

9:57 p.m.

...

...
Adding a slot is a bit painful now that there are so many new slots already (adding it to the end means you have to add tons of zeros, adding it to the middle means I have to edit every file).

Hmm, what about a type object initialisation function that takes "named arguments" via varargs: PyType_Initialize(&PyUnicode_Type, TYPE_TYPE, &PyType_Type, TYPE_NAME, "unicode", SLOT_DESTRUCTOR, _PyUnicode_Free, SLOT_CMP, unicode_compare, SLOT_REPR, unicode_repr, SLOT_SEQ, unicode_as_sequence, SLOT_HASH, unicode_hash, DONE )

The SLOT_xxx arguments would be #defines like this #define DONE 0 #define TYPE_TYPE 1 #define TYPE_NAME 2 #define SLOT_DESTRUCTOR 3 #define SLOT_CMP 4

Adding a new slot would require much less work: define a new slot

*somewhere* in the struct, define a new SLOT_xxx and add SLOT_xxx, foo_xxx to the call to the initializer for every type that implements this struct. Performance shouldn't be a problem, because this function would only be called once for every type. And we could get rid of the problem with static initialization of ob_type with some compilers.

Cool idea. It would definitely be worth to pursue this when starting from scratch. Right now, it would only slow us down to convert all the existing statically initialized types to use this mechanism. Also, for some of the built-in types we'd have to decide on a point in the initialization sequence where to initialize them.

...

...
[...]

I would add __unicode__ support without tp_unicode right away.

I like this idea. There is no need to piggyback unicode representation of objects onto tp_str/__str__. Both PyObject_Str and PyObject_Unicode will get much simpler.

But we will need int.__unicode__, float.__unicode__ etc. (or fallback to __str__)

We should fallback to tp_str -- for most of these types there's never a need to generate non-ASCII characters so using the ASCII representation and converting that to Unicode would work just fine.

...

BTW, what about __repr__? Should this be allowed to return unicode objects? (currently it is, and uses PyUnicode_AsUnicodeEscapeString)

But this is rarely what the caller expects, and it violates the guideline that repr() should return something that can be fed back to the parser. I'd rather change the rules to require that __repr__ and tp_repr return an 8-bit string at all times. --Guido van Rossum (home page: http://www.python.org/~guido/)

Walter Dörwald

26 Sep 26 Sep

7:11 p.m.

Guido van Rossum wrote:

...

[...]

...
BTW, what about __repr__? Should this be allowed to return unicode objects? (currently it is, and uses PyUnicode_AsUnicodeEscapeString)

But this is rarely what the caller expects, and it violates the guideline that repr() should return something that can be fed back to the parser.

I'd say this is a bug in the parser! >;)

...

I'd rather change the rules to require that __repr__ and tp_repr return an 8-bit string at all times.

Sounds reasonable and again makes the implementation simpler (as long as we're not in an all unicode world). Bye, Walter Dörwald

8246

Age (days ago)

8251

Last active (days ago)

List overview

Download

8 comments

3 participants

participants (3)

Guido van Rossum
M.-A. Lemburg
Walter Dörwald

str() vs. unicode()

M.-A. Lemburg

Guido van Rossum

M.-A. Lemburg

Guido van Rossum

M.-A. Lemburg

Guido van Rossum

Walter Dörwald

Guido van Rossum

Walter Dörwald

tags

participants (3)