Mailman 3 __str__ vs. __unicode__ - Python-Dev

newer
python-dev Summary for 2005-01-16...

str vs. unicode

Walter Dörwald

18 Jan 2005 18 Jan '05

4:05 p.m.

__str__ and __unicode__ seem to behave differently. A __str__ overwrite in a str subclass is used when calling str(), a __unicode__ overwrite in a unicode subclass is *not* used when calling unicode(): ------------------------------- class str2(str): def __str__(self): return "foo" x = str2("bar") print str(x) class unicode2(unicode): def __unicode__(self): return u"foo" x = unicode2(u"bar") print unicode(x) ------------------------------- This outputs: foo bar IMHO this should be fixed so that __unicode__() is used in the second case too. Bye, Walter Dörwald

Show replies by date

M.-A. Lemburg

18 Jan 18 Jan

6:38 p.m.

Walter Dörwald wrote:

...

__str__ and __unicode__ seem to behave differently. A __str__ overwrite in a str subclass is used when calling str(), a __unicode__ overwrite in a unicode subclass is *not* used when calling unicode():

------------------------------- class str2(str): def __str__(self): return "foo"

x = str2("bar") print str(x)

class unicode2(unicode): def __unicode__(self): return u"foo"

x = unicode2(u"bar") print unicode(x) -------------------------------

This outputs: foo bar

IMHO this should be fixed so that __unicode__() is used in the second case too.

If you drop the base class for unicode, this already works. This code in object.c:PyObject_Unicode() is responsible for the sub-class version not doing what you'd expect: if (PyUnicode_Check(v)) { /* For a Unicode subtype that's not a Unicode object, return a true Unicode object with the same data. */ return PyUnicode_FromUnicode(PyUnicode_AS_UNICODE(v), PyUnicode_GET_SIZE(v)); } So the question is whether conversion of a Unicode sub-type to a true Unicode object should honor __unicode__ or not. The same question can be asked for many other types, e.g. floats (and __float__), integers (and __int__), etc.

...

...
...
class float2(float): ... def __float__(self): ... return 3.141 ... float(float2(1.23)) 1.23 class int2(int): ... def __int__(self): ... return 42 ... int(int2(123)) 123

I think we need general consensus on what the strategy should be: honor these special hooks in conversions to base types or not ? Maybe the string case is the real problem ... :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 10 2005)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

19 Jan 19 Jan

10:40 a.m.

M.-A. Lemburg wrote:

...

Walter Dörwald wrote:

...
__str__ and __unicode__ seem to behave differently. A __str__ overwrite in a str subclass is used when calling str(), a __unicode__ overwrite in a unicode subclass is *not* used when calling unicode():

[...]

If you drop the base class for unicode, this already works.

That's cheating! ;) My use case is an XML DOM API: __unicode__() should extract the character data from the DOM. For Text nodes this is the text, for comments and processing instructions this is u"" etc. To reduce memory footprint and to inherit all the unicode methods, it would be good if Text, Comment and ProcessingInstruction could be subclasses of unicode.

...

This code in object.c:PyObject_Unicode() is responsible for the sub-class version not doing what you'd expect:

if (PyUnicode_Check(v)) { /* For a Unicode subtype that's not a Unicode object, return a true Unicode object with the same data. */ return PyUnicode_FromUnicode(PyUnicode_AS_UNICODE(v), PyUnicode_GET_SIZE(v)); }

So the question is whether conversion of a Unicode sub-type to a true Unicode object should honor __unicode__ or not.

The same question can be asked for many other types, e.g. floats (and __float__), integers (and __int__), etc.

...
...
...
class float2(float): ... def __float__(self): ... return 3.141 ... float(float2(1.23)) 1.23 class int2(int): ... def __int__(self): ... return 42 ... int(int2(123)) 123

I think we need general consensus on what the strategy should be: honor these special hooks in conversions to base types or not ?

I'd say, these hooks should be honored, because it gives us more possibilities: If you want the original value, simply don't implement the hook.

...

Maybe the string case is the real problem ... :-)

At least it seems that the string case is the exception. So if we fix __str__ this would be a bugfix for 2.4.1. If we fix the rest, this would be a new feature for 2.5. Bye, Walter Dörwald

Bob Ippolito

11:10 a.m.

On Jan 19, 2005, at 4:40, Walter Dörwald wrote:

...

M.-A. Lemburg wrote:

...
Walter Dörwald wrote:

...
__str__ and __unicode__ seem to behave differently. A __str__ overwrite in a str subclass is used when calling str(), a __unicode__ overwrite in a unicode subclass is *not* used when calling unicode():

[...] If you drop the base class for unicode, this already works.

That's cheating! ;)

My use case is an XML DOM API: __unicode__() should extract the character data from the DOM. For Text nodes this is the text, for comments and processing instructions this is u"" etc. To reduce memory footprint and to inherit all the unicode methods, it would be good if Text, Comment and ProcessingInstruction could be subclasses of unicode.

It sounds like a really bad idea to have a class that supports both of these properties: - unicode as a base class - non-trivial result from unicode(foo) Do you REALLY think this should be True?! isinstance(foo, unicode) and foo != unicode(foo) Why don't you just call this "extract character data" method something other than __unicode__? That way, you get the reduced memory footprint and convenience methods of unicode, with none of the craziness. -bob

Walter Dörwald

12:19 p.m.

Bob Ippolito wrote:

...

On Jan 19, 2005, at 4:40, Walter Dörwald wrote:

...
[...] That's cheating! ;)

My use case is an XML DOM API: __unicode__() should extract the character data from the DOM. For Text nodes this is the text, for comments and processing instructions this is u"" etc. To reduce memory footprint and to inherit all the unicode methods, it would be good if Text, Comment and ProcessingInstruction could be subclasses of unicode.

It sounds like a really bad idea to have a class that supports both of these properties: - unicode as a base class - non-trivial result from unicode(foo)

Do you REALLY think this should be True?! isinstance(foo, unicode) and foo != unicode(foo)

Why don't you just call this "extract character data" method something other than __unicode__?

IMHO __unicode__ is the most natural and logical choice. isinstance(foo, unicode) is just an implementation detail. But you're right: the consequences of this can be a bit scary.

...

That way, you get the reduced memory footprint and convenience methods of unicode, with none of the craziness.

Without this craziness we wouldn't have discovered the problem. ;) Whether this craziness gets implemented, depends on the solution to this problem. Bye, Walter Dörwald

Alex Martelli

12:22 p.m.

On 2005 Jan 19, at 11:10, Bob Ippolito wrote:

...

Do you REALLY think this should be True?! isinstance(foo, unicode) and foo != unicode(foo)

Hmmmm -- why not? In the generic case, talking about some class B, it certainly violates no programming principle known to me that "isinstance(foo, B) and foo != B(foo)"; it seems a rather common case -- ``casting to the base class'' (in C++ terminology, I guess) ``slices off'' some parts of foo, and thus equality does not hold. If this is specifically a bad idea for the specific case where B is unicode, OK, that's surely possible, but if so it seems it should be possible to explain this in terms of particular properties of type unicode. Alex

M.-A. Lemburg

12:42 p.m.

Walter Dörwald wrote:

...

M.-A. Lemburg wrote:

...
So the question is whether conversion of a Unicode sub-type to a true Unicode object should honor __unicode__ or not.

The same question can be asked for many other types, e.g. floats (and __float__), integers (and __int__), etc.

...
...
...
class float2(float): ... def __float__(self): ... return 3.141 ... float(float2(1.23)) 1.23 class int2(int): ... def __int__(self): ... return 42 ... int(int2(123)) 123

I think we need general consensus on what the strategy should be: honor these special hooks in conversions to base types or not ?

I'd say, these hooks should be honored, because it gives us more possibilities: If you want the original value, simply don't implement the hook.

...
Maybe the string case is the real problem ... :-)

At least it seems that the string case is the exception.

Indeed.

...

So if we fix __str__ this would be a bugfix for 2.4.1. If we fix the rest, this would be a new feature for 2.5.

I have a feeling that we're better off with the bug fix than the new feature. __str__ and __unicode__ as well as the other hooks were specifically added for the type constructors to use. However, these were added at a time where sub-classing of types was not possible, so it's time now to reconsider whether this functionality should be extended to sub-classes as well. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 10 2005)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Nick Coghlan

1:26 p.m.

M.-A. Lemburg wrote:

...

...
So if we fix __str__ this would be a bugfix for 2.4.1. If we fix the rest, this would be a new feature for 2.5.

I have a feeling that we're better off with the bug fix than the new feature.

__str__ and __unicode__ as well as the other hooks were specifically added for the type constructors to use. However, these were added at a time where sub-classing of types was not possible, so it's time now to reconsider whether this functionality should be extended to sub-classes as well.

It seems oddly inconsistent though: """Define __str__ to determine what your class returns for str(x). NOTE: This won't work if your class directly or indirectly inherits from str. If that is the case, you cannot alter the results of str(x).""" At present, most of the type constructors need the caveat, whereas __str__ actually agrees with the simple explanation in the first line. Going back to PyUnicode, PyObject_Unicode's handling of subclasses of builtins is decidedly odd: Py> class C(str): ... def __str__(self): return "I am a string!" ... def __unicode__(self): return "I am not unicode!" ... Py> c = C() Py> str(c) 'I am a string!' Py> unicode(c) u'' Cheers, Nick. -- Nick Coghlan | ncoghlan@email.com | Brisbane, Australia --------------------------------------------------------------- http://boredomandlaziness.skystorm.net

M.-A. Lemburg

1:50 p.m.

Nick Coghlan wrote:

...

M.-A. Lemburg wrote:

...
...
So if we fix __str__ this would be a bugfix for 2.4.1. If we fix the rest, this would be a new feature for 2.5.

I have a feeling that we're better off with the bug fix than the new feature.

__str__ and __unicode__ as well as the other hooks were specifically added for the type constructors to use. However, these were added at a time where sub-classing of types was not possible, so it's time now to reconsider whether this functionality should be extended to sub-classes as well.

It seems oddly inconsistent though:

"""Define __str__ to determine what your class returns for str(x).

NOTE: This won't work if your class directly or indirectly inherits from str. If that is the case, you cannot alter the results of str(x)."""

At present, most of the type constructors need the caveat, whereas __str__ actually agrees with the simple explanation in the first line.

Going back to PyUnicode, PyObject_Unicode's handling of subclasses of builtins is decidedly odd:

Those APIs were all written long before there were sub-classes of types.

...

Py> class C(str): ... def __str__(self): return "I am a string!" ... def __unicode__(self): return "I am not unicode!" ... Py> c = C() Py> str(c) 'I am a string!' Py> unicode(c) u''

Ah, looks as if the function needs a general overhaul :-) This section should be do a PyString_CheckExact(): if (PyString_Check(v)) { Py_INCREF(v); res = v; } But before we start hacking the function, we need a general picture of what we think is right. Note, BTW, that there is also a tp_str slot that serves as hook. The overall solution to this apparent mess should be consistent for all hooks (__str__, tp_str, __unicode__ and a future tp_unicode). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 10 2005)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Nick Coghlan

2:27 p.m.

M.-A. Lemburg wrote:

...

Those APIs were all written long before there were sub-classes of types.

Understood. PyObject_Unicode certainly looked like an 'evolved' piece of code :)

...

But before we start hacking the function, we need a general picture of what we think is right.

Aye.

...

Note, BTW, that there is also a tp_str slot that serves as hook. The overall solution to this apparent mess should be consistent for all hooks (__str__, tp_str, __unicode__ and a future tp_unicode).

I imagine many people are like me, with __str__ being the only one of these hooks they use frequently (Helping out with the Decimal implementation is the only time I can recall using the slots for the numeric types, and I rarely need to deal with Unicode). Anyway, they're heavy use suggests to me that __str__ and str() are likely to provide a good model for the desired behaviour - they're the ones that are likely to have been nudged in the most useful direction by bug reports and the like. Regards, Nick. -- Nick Coghlan | ncoghlan@email.com | Brisbane, Australia --------------------------------------------------------------- http://boredomandlaziness.skystorm.net

Walter Dörwald

6:31 p.m.

Nick Coghlan wrote:

...

[...] I imagine many people are like me, with __str__ being the only one of these hooks they use frequently (Helping out with the Decimal implementation is the only time I can recall using the slots for the numeric types, and I rarely need to deal with Unicode).

Anyway, they're heavy use suggests to me that __str__ and str() are likely to provide a good model for the desired behaviour - they're the ones that are likely to have been nudged in the most useful direction by bug reports and the like.

+1 __foo__ provides conversion to foo, no matter whether foo is among the direct or indirect base classes. Simply moving the PyUnicode_Check() call in PyObject_Unicode() after the __unicode__ call (after the PyErr_Clear() call) will implement this (but does not fix Nick's bug). Running the test suite with this change reveals no other problems. Bye, Walter Dörwald

Walter Dörwald

21 Jan 21 Jan

1:10 p.m.

M.-A. Lemburg wrote:

...

[...] __str__ and __unicode__ as well as the other hooks were specifically added for the type constructors to use. However, these were added at a time where sub-classing of types was not possible, so it's time now to reconsider whether this functionality should be extended to sub-classes as well.

So can we reach consensus on this, or do we need a BDFL pronouncement? Bye, Walter Dörwald

M.-A. Lemburg

23 Jan 23 Jan

3:27 p.m.

Walter Dörwald wrote:

...

M.-A. Lemburg wrote:

...
[...]

...
__str__ and __unicode__ as well as the other hooks were specifically added for the type constructors to use. However, these were added at a time where sub-classing of types was not possible, so it's time now to reconsider whether this functionality should be extended to sub-classes as well.

So can we reach consensus on this, or do we need a BDFL pronouncement?

I don't have a clear picture of what the consensus currently looks like :-) If we're going for for a solution that implements the hook awareness for all __<typename>__ hooks, I'd be +1 on that. If we only touch the __unicode__ case, we'd only be created yet another special case. I'd vote -0 on that. Another solution would be to have all type constructors ignore the __<typename>__ hooks (which were originally added to provide classes with a way to mimic type behavior). In general, I think we should try to get rid off special cases and go for a clean solution (either way). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 23 2005)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

25 Jan 25 Jan

11:13 p.m.

M.-A. Lemburg wrote:

...

Walter Dörwald wrote:

...
M.-A. Lemburg wrote:

...
[...]

...
__str__ and __unicode__ as well as the other hooks were specifically added for the type constructors to use. However, these were added at a time where sub-classing of types was not possible, so it's time now to reconsider whether this functionality should be extended to sub-classes as well.

So can we reach consensus on this, or do we need a BDFL pronouncement?

I don't have a clear picture of what the consensus currently looks like :-)

If we're going for for a solution that implements the hook awareness for all __<typename>__ hooks, I'd be +1 on that. If we only touch the __unicode__ case, we'd only be created yet another special case. I'd vote -0 on that. [...]

Here's the patch that implements this for int/long/float/unicode: http://www.python.org/sf/1109424 Note that complex already did the right thing. For int/long/float this is implemented in the following way: Converting an instance of a subclass to the base class is done in the appropriate slot of the type (i.e. intobject.c::int_int() etc.) instead of in PyNumber_Int()/PyNumber_Long()/PyNumber_Float(). It's still possible for a conversion method to return an instance of a subclass of int/long/float. Bye, Walter Dörwald

Brett C.

24 Feb 24 Feb

8:23 a.m.

Walter Dörwald wrote:

...

M.-A. Lemburg wrote:

...
Walter Dörwald wrote:

...
M.-A. Lemburg wrote:

...
[...]

...
__str__ and __unicode__ as well as the other hooks were specifically added for the type constructors to use. However, these were added at a time where sub-classing of types was not possible, so it's time now to reconsider whether this functionality should be extended to sub-classes as well.

So can we reach consensus on this, or do we need a BDFL pronouncement?

I don't have a clear picture of what the consensus currently looks like :-)

If we're going for for a solution that implements the hook awareness for all __<typename>__ hooks, I'd be +1 on that. If we only touch the __unicode__ case, we'd only be created yet another special case. I'd vote -0 on that.

...
[...]

Here's the patch that implements this for int/long/float/unicode: http://www.python.org/sf/1109424

Any movement on this? +1 for making things work like str; if a subclass overrides __str__ it should use that method. If correctness of what is returned is a worry then a check could be tossed in before the value is returned. -Brett

Walter Dörwald

4:32 p.m.

Brett C. wrote:

...

Walter Dörwald wrote:

...
M.-A. Lemburg wrote:

...
[...] I don't have a clear picture of what the consensus currently looks like :-)

If we're going for for a solution that implements the hook awareness for all __<typename>__ hooks, I'd be +1 on that. If we only touch the __unicode__ case, we'd only be created yet another special case. I'd vote -0 on that. [...]

Here's the patch that implements this for int/long/float/unicode: http://www.python.org/sf/1109424

Any movement on this? +1 for making things work like str; if a subclass overrides __str__ it should use that method. If correctness of what is returned is a worry then a check could be tossed in before the value is returned.

It already works that way: Python 2.5a0 (#1, Feb 24 2005, 16:25:04) [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-113)] on linux2 Type "help", "copyright", "credits" or "license" for more information.

...

...
...
class u(unicode): ... def __unicode__(self): return 42 ... unicode(u("foo")) Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: coercing to Unicode: need string or buffer, int found class i(int): ... def __int__(self): return "foo" ... int(i(42)) Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: __int__ returned non-int (type str) class l(long): ... def __long__(self): return "foo" ... long(l(42)) Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: __long__ returned non-long (type str) class f(float): ... def __float__(self): return "foo" ... float(f(42)) Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: __float__ returned non-float (type str)

Bye, Walter Dörwald

Brett C.

9:10 p.m.

Walter Dörwald wrote:

...

Brett C. wrote:

...
Walter Dörwald wrote:

...
M.-A. Lemburg wrote:

...
[...] I don't have a clear picture of what the consensus currently looks like :-)

If we're going for for a solution that implements the hook awareness for all __<typename>__ hooks, I'd be +1 on that. If we only touch the __unicode__ case, we'd only be created yet another special case. I'd vote -0 on that. [...]

Here's the patch that implements this for int/long/float/unicode: http://www.python.org/sf/1109424

Any movement on this? +1 for making things work like str; if a subclass overrides __str__ it should use that method. If correctness of what is returned is a worry then a check could be tossed in before the value is returned.

It already works that way:

Python 2.5a0 (#1, Feb 24 2005, 16:25:04) [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-113)] on linux2 Type "help", "copyright", "credits" or "license" for more information.

...
...
...
class u(unicode): ... def __unicode__(self): return 42

Well then I am +1 on doing this. Since this is a semantic change probably need Guido to OK this? -Brett

Brett C.

25 Feb 25 Feb

3:58 a.m.

Brett C. wrote:

...

Walter Dörwald wrote:

...
Brett C. wrote:

...
Walter Dörwald wrote:

...
M.-A. Lemburg wrote:

...
[...] I don't have a clear picture of what the consensus currently looks like :-)

If we're going for for a solution that implements the hook awareness for all __<typename>__ hooks, I'd be +1 on that. If we only touch the __unicode__ case, we'd only be created yet another special case. I'd vote -0 on that. [SNIP] Since this is a semantic change probably need Guido to OK this?

... which we now have. Assigned the patch to myself. I will get to it eventually. -Brett

Aahz

19 Jan 19 Jan

4:04 p.m.

On Wed, Jan 19, 2005, Walter D?rwald wrote:

...

M.-A. Lemburg wrote:

...
Maybe the string case is the real problem ... :-)

At least it seems that the string case is the exception. So if we fix __str__ this would be a bugfix for 2.4.1.

Nope. Unless you're claiming the __str__ behavior is new in 2.4? (Haven't been following the thread closely.) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "19. A language that doesn't affect the way you think about programming, is not worth knowing." --Alan Perlis

7020

Age (days ago)

7058

Last active (days ago)

List overview

Download

18 comments

7 participants

participants (7)

Aahz
Alex Martelli
Bob Ippolito
Brett C.
M.-A. Lemburg
Nick Coghlan
Walter Dörwald

__str__ vs. __unicode__

tags

participants (7)

str vs. unicode