Mailman 3 unicode inconsistency? - Python-Dev

unicode inconsistency?

older
scrambling python in a microsoft...

Neil Schemenauer

9 Sep 2004 9 Sep '04

8:07 p.m.

Perhaps this is more approprate for python-list but I looks like a bug to me. Example code: class A: def __str__(self): return u'\u1234' '%s' % u'\u1234' # this works '%s' % A() # this doesn't work It will work if 'A' subclasses from 'unicode' but should not be necessary, IMHO. Any reason why this shouldn't be fixed? Neil

Show replies by date

Aahz

9 Sep 9 Sep

8:09 p.m.

On Thu, Sep 09, 2004, Neil Schemenauer wrote:

...

Perhaps this is more approprate for python-list but I looks like a bug to me. Example code:

class A: def __str__(self): return u'\u1234'

'%s' % u'\u1234' # this works '%s' % A() # this doesn't work

It will work if 'A' subclasses from 'unicode' but should not be necessary, IMHO. Any reason why this shouldn't be fixed?

Check the recent python-dev archives for a long and nauseating thread about interactions between __str__ and unicode. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines." --Ralph Waldo Emerson

"Martin v. Löwis"

8:42 p.m.

Aahz wrote:

...

...
It will work if 'A' subclasses from 'unicode' but should not be necessary, IMHO. Any reason why this shouldn't be fixed?

Check the recent python-dev archives for a long and nauseating thread about interactions between __str__ and unicode.

Although that really doesn't answer this particular question. It was about str() and its interaction with __str__ and __unicode__, and whether Python should support __unicode__. For the specific issue, I would maintain that str() should always return string objects. I'm not so sure about %s since, as Neil observes, '%s' % unicode_string gives a unicode result. I can't see any harm by supporting this operation also if __str__ returns a Unicode object. Regards, Martin

Tim Peters

9 p.m.

[Martin v. Löwis]

...

... For the specific issue, I would maintain that str() should always return string objects.

__builtin__.str() always does -- or raises an exception. Same for PyObject_Str() and PyObject_Repr().

...

I'm not so sure about %s since, as Neil observes, '%s' % unicode_string gives a unicode result.

That's because PyString_Format()'s '%s' processing special-cases the snot out of unicode *inputs*. All other inputs to '%s' (and '%r') go thru PyObject_Str() or PyObject_Repr(), and, as above, those never return a unicode. In Neil's case, they raise the expected exception, and there's nothing sane PyString_Format can do about that.

...

I can't see any harm by supporting this operation also if __str__ returns a Unicode object.

It doesn't sound like a good idea to me, at least in part because it would be darned messy to implement short of saying "OK, we don't give a rip anymore about what type of objects PyObject_{Str,Repr} return", and that would have broader consequences that just letting Neil get away with whatever he's trying to do with str.__mod__.

Neil Schemenauer

10:03 p.m.

On Thu, Sep 09, 2004 at 03:00:07PM -0400, Tim Peters wrote:

...

[Martin v. L?wis]

...
I can't see any harm by supporting this operation also if __str__ returns a Unicode object.

It doesn't sound like a good idea to me, at least in part because it would be darned messy to implement short of saying "OK, we don't give a rip anymore about what type of objects PyObject_{Str,Repr} return"

Just to be clear, I don't propose allowing PyObject_Str and PyObject_Repr to return unicode objects. That would be a disaster, IMO. Neil

Neil Schemenauer

9 Mar 9 Mar

3:08 a.m.

On Sat, Apr 04, 1998 at 07:04:02AM +0000, Tim Peters wrote:

...

[Martin v. L?wis]

...
I can't see any harm by supporting this operation also if __str__ returns a Unicode object.

It doesn't sound like a good idea to me, at least in part because it would be darned messy to implement short of saying "OK, we don't give a rip anymore about what type of objects PyObject_{Str,Repr} return",

It's about 10 lines of code. See http://python.org/sf/1159501 . Neil

M.-A. Lemburg

11:10 a.m.

Neil Schemenauer wrote:

...

On Sat, Apr 04, 1998 at 07:04:02AM +0000, Tim Peters wrote:

...
[Martin v. L?wis]

...
I can't see any harm by supporting this operation also if __str__ returns a Unicode object.

It doesn't sound like a good idea to me, at least in part because it would be darned messy to implement short of saying "OK, we don't give a rip anymore about what type of objects PyObject_{Str,Repr} return",

It's about 10 lines of code. See http://python.org/sf/1159501 .

The patch implements the PyObjbect_Text() idea (an API that returns a basestring instance, ie. string or unicode) and then uses this in '%s' (the string version) to properly propogate to u'%s' (the unicode version). Maybe we should also expose the C API as suggested in the patch, e.g. as text(obj). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 09 2005)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Neil Schemenauer

7:16 p.m.

On Wed, Mar 09, 2005 at 11:10:59AM +0100, M.-A. Lemburg wrote:

...

The patch implements the PyObjbect_Text() idea (an API that returns a basestring instance, ie. string or unicode) and then uses this in '%s' (the string version) to properly propogate to u'%s' (the unicode version).

Maybe we should also expose the C API as suggested in the patch, e.g. as text(obj).

Perhaps the right thing to do is introduce a new format code that means insert text(obj) instead of str(obj), e.g %t. If we do that though then we should make "'%s' % u'xyz'" return a string instead of a unicode object. I suspect that would break a lot of code. OTOH, having %s mean text(obj) instead of str(obj) may work just fine. People who want it to mean str() generally don't have any unicode strings floating around so text() has the same effect. People who are using unicode probably would find text() to be more useful behavior. I think that's why someone hacked PyString_Format to sometimes return unicode strings. Regarding the use of __str__, to return a unicode object: we could introduce a new slot (e.g. __text__) instead. However, I can't see any advantage to that. If someone really wants a str object then they call str() or PyObject_Str(). Neil

M.-A. Lemburg

8:04 p.m.

Neil Schemenauer wrote:

...

On Wed, Mar 09, 2005 at 11:10:59AM +0100, M.-A. Lemburg wrote:

...
The patch implements the PyObjbect_Text() idea (an API that returns a basestring instance, ie. string or unicode) and then uses this in '%s' (the string version) to properly propogate to u'%s' (the unicode version).

Maybe we should also expose the C API as suggested in the patch, e.g. as text(obj).

Perhaps the right thing to do is introduce a new format code that means insert text(obj) instead of str(obj), e.g %t. If we do that though then we should make "'%s' % u'xyz'" return a string instead of a unicode object. I suspect that would break a lot of code.

It would result in lots of UnicodeErrors due to failing conversion of the Unicode string to a string. Plus it would break with the general rule of always coercing to Unicode (see below) and lose us the ability to write polymorphic code.

...

OTOH, having %s mean text(obj) instead of str(obj) may work just fine. People who want it to mean str() generally don't have any unicode strings floating around so text() has the same effect. People who are using unicode probably would find text() to be more useful behavior. I think that's why someone hacked PyString_Format to sometimes return unicode strings.

That wasn't a hack: it's part of the Unicode integration logic which always coerces to Unicode if strings and Unicode meet. In the above case a string format string meets a Unicode object as argument which then results in a Unicode object to be returned.

...

Regarding the use of __str__, to return a unicode object: we could introduce a new slot (e.g. __text__) instead. However, I can't see any advantage to that. If someone really wants a str object then they call str() or PyObject_Str().

Right. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 09 2005)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Neil Schemenauer

9 Sep 9 Sep

8:50 p.m.

On Thu, Sep 09, 2004 at 02:09:56PM -0400, Aahz wrote:

...

Check the recent python-dev archives for a long and nauseating thread about interactions between __str__ and unicode.

Using __unicode__ doesn't help. The core problem is that you cannot create a class that behaves like 'unicode' in this operation without subclassing from 'unicode'. That violates the "duck typing" design principle of Python. We violate it other places, usually in the name of efficiency, but I see no good reason in this case. I suspect the fix will be pretty straight forward (call tp_str and if the result is 'unicode' the produce a 'unicode' string). Again, is there some reason why we don't want this behavior? Neil

Tim Peters

9:11 p.m.

[Neil Schemenauer]

...

... I suspect the fix will be pretty straight forward (call tp_str and if the result is 'unicode' the produce a 'unicode' string). Again, is there some reason why we don't want this behavior?

Yes: '%s' is documented as "String (converts any python object using str())". It's str(A()) that raises the exception you're seeing, not interpolation. To worm around that, you'll effectively have to duplicate PyObject_Str's implementation (which is more than just calling tp_str -- that may not exist -- you'll end up at least duplicating PyObject_Repr's implementation too) inside PyString_Format(), and end up with a mess that's harder to explain too. The *real* problem (IMO) is that we don't have a format code that means "stick the unicode representation here", i.e. there's no format code that triggers PyObject_Unicode() directly. unicode.__mod__ treats '%s' that way, but that isn't documented.

Neil Schemenauer

9:57 p.m.

On Thu, Sep 09, 2004 at 03:11:51PM -0400, Tim Peters wrote:

...

'%s' is documented as "String (converts any python object using str())". It's str(A()) that raises the exception you're seeing, not interpolation.

Shouldn't '%s' % u'\u1234' also raise an exception then?

...

To worm around that, you'll effectively have to duplicate PyObject_Str's implementation

Yes. I want something like "PyObject_UnicodeOrStr" that would return either a unicode object or a str object. That would make it easier to write code that produces 'str' results if unicode characters don't appear in any of the inputs. Having __str__ methods that can return either 'unicode' or 'str' objects is also very handy (I don't see how you can say that it doesn't make any sense). Perhaps I am on the wrong track. However, if I understand the /F bot correctly, he favours a design that does not force everthing to unicode strings. Neil

Fredrik Lundh

10:12 p.m.

Neil Schemenauer wrote:

...

Perhaps I am on the wrong track. However, if I understand the /F bot correctly, he favours a design that does not force everthing to unicode strings.

that's correct. I'm beginning to think that we need an extra method (__text__), that can return any kind of string that's compatible with Python's text model. (in today's CPython, that's an 8-bit string with ASCII only, or a Uni- code string. future Python's may support more string types, at least at the C implementation level). I'm not sure we can change __str__ or __unicode__ without breaking code in really obscure ways (but I'd be happy to be proven wrong). </F>

Aahz

10:21 p.m.

On Thu, Sep 09, 2004, Fredrik Lundh wrote:

...

I'm beginning to think that we need an extra method (__text__), that can return any kind of string that's compatible with Python's text model.

+1 While we're at it, that would be a good opportunity to add the __index__ method (for int-like objects that actually support indexing). That would get rid of the issues with using floats as inappropriate inputs. Can't require __index__ until 3.0, but we can start making it available. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines." --Ralph Waldo Emerson

Tim Peters

10:59 p.m.

[Tim]

...

...
'%s' is documented as "String (converts any python object using str())". It's str(A()) that raises the exception you're seeing, not interpolation.

[Neil]

...

Shouldn't '%s' % u'\u1234' also raise an exception then?

Yes, but the existence of one undocumented extension isn't sufficient reason to multiply them. The "Unicode exception" here is at least easy to explain. To make your case work, we somehow have to explain that although virtually all ways of invoking __str__ produce an 8-bit encoding of a unicode return value, for some magical reason str.__mod__ does not. The existing "Unicode exception" consists solely of saying "but unicode inputs don't invoke str(), and force the interpolation to get passed to unicode.__mod__ instead".

...

Yes. I want something like "PyObject_UnicodeOrStr" that would return either a unicode object or a str object. That would make it easier to write code that produces 'str' results if unicode characters don't appear in any of the inputs.

I think biting the Unicode bullet whole is saner, but suit yourself.

...

Having __str__ methods that can return either 'unicode' or 'str' objects is also very handy (I don't see how you can say that it doesn't make any sense).

Didn't we go thru that last week <wink>? Yes: [Neil] [... the same class as today's class ...] [Martin] > This class is incorrect: it does not support str(). [Neil] > Can you be more specific about what is incorrect with the above > class? [Martin] In the default installation, it gives a UnicodeEncodeError. You didn't respond to that (at least not that I saw), so I assumed you accepted Martin's nag. Having a __str__ that returns a unicode object that the default encoding can't handle is clearly (IMO) begging for trouble.

...

Perhaps I am on the wrong track. However, if I understand the /F bot correctly, he favours a design that does not force everthing to unicode strings.

Saying it doesn't make sense to have a __str__ method return a Unicode value that can't be encoded *as* a str isn't asking anyone to force anything to Unicode. __str__ is still trying hard to retain a *distinction* between str and unicode. PyObject_Unicode() no longer plays along with that distinction, but I (mildly) wish it still did.

Tim Peters

8:44 p.m.

[Neil Schemenauer]

...

Perhaps this is more approprate for python-list but I looks like a bug to me. Example code:

class A: def __str__(self): return u'\u1234'

'%s' % u'\u1234' # this works '%s' % A() # this doesn't work

It will work if 'A' subclasses from 'unicode' but should not be necessary, IMHO.

You know better than to say "doesn't work". I assume you mean the latter raises UnicodeEncodeError.

...

Any reason why this shouldn't be fixed?

Didn't we just go thru this, last week or so? PyObject_Str() never returns a unicode (it returns a str). That is, str(A()) raises UnicodeEncodeError, and that's out of interpolation's hands. As Martin said last time, a __str__ method that returns a unicode doesn't make much sense. I'm not sure you really mean "it will work if 'A' subclasses from 'unicode'" either:

...

...
...
class A(unicode): ... def __str__(self): ... return u'\u1234' ... '%s' % A() u'' len(_) 0

That is, A.__str__ is ignored if A subclasses from Unicode. So "doesn't blow up" seems more on-target than "works" -- I don't think you expected an empty Unicode string here.

7185

Age (days ago)

7366

Last active (days ago)

List overview

Download

15 comments

6 participants

participants (6)

"Martin v. Löwis"
Aahz
Fredrik Lundh
M.-A. Lemburg
Neil Schemenauer
Tim Peters

unicode inconsistency?

tags

participants (6)