Mailman 3 String formatting / unicode 2.5 bug? - Python-Dev

newer
GeneratorExit is unintuitive and...

String formatting / unicode 2.5 bug?

older
Re: [Python-Dev] [Python-checkins]...

John J Lee

19 Aug 2006 19 Aug '06

7:27 p.m.

Is this a bug? # run with 2.4 and then with 2.5 (I'm running release25-maint:51410) class a(object): def __getattribute__(self, name): print "accessing %r.%s" % (self, name) return object.__getattribute__(self, name) def __str__(self): print "__str__" return u'hi' print repr(str(a())) print print repr("%s" % a()) John

Show replies by date

Nick Coghlan

20 Aug 20 Aug

12:03 a.m.

John J Lee wrote:

...

Is this a bug?

I don't believe so - the string formatting documentation states that the result will be unicode if either the format string is unicode or any of the objects passed to a %s format code is unicode. That latter part has just been extended to include any object that returns Unicode from __str__, instead of being restricted to actual Unicode instances. Note that the following behaves the same way regardless of whether you use 2.4 or 2.5: "%s" % 'hi' "%s" % u'hi' And once the result has been promoted to unicode, __unicode__ is used directly:

...

...
...
print repr("%s%s" % (a(), a())) __str__ accessing <__main__.a object at 0x00AF66F0>.__unicode__ __str__ accessing <__main__.a object at 0x00AF6390>.__unicode__ __str__ u'hihi'

Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

John J Lee

7:45 a.m.

On Sun, 20 Aug 2006, Nick Coghlan wrote:

...

John J Lee wrote:

...
Is this a bug?

I don't believe so - the string formatting documentation states that the result will be unicode if either the format string is unicode or any of the objects passed to a %s format code is unicode.

That latter part has just been extended to include any object that returns Unicode from __str__, instead of being restricted to actual Unicode instances.

Note that the following behaves the same way regardless of whether you use 2.4 or 2.5: "%s" % 'hi' "%s" % u'hi'

Given that, the following wording should be changed: http://docs.python.org/lib/typesseq-strings.html Conversion Meaning Notes ... s String (converts any python object using str()). (4) ... (4) If the object or format provided is a unicode string, the resulting string will also be unicode. The note (4) says that the result will be unicode, but it doesn't say how, in this case, that comes about. This case is confusing because the docs claim string formatting with %s "converts ... using str()", and yet str(a()) returns a bytestring. Does it *really* use str, or just __str__? Surely the latter? (given the observed behaviour, and not reading the C source) FWIW, this change broke epydoc (fails with an AssertionError -- so perhaps without the assert it would still "work", dunno).

...

And once the result has been promoted to unicode, __unicode__ is used directly:

...
...
...
print repr("%s%s" % (a(), a())) __str__ accessing <__main__.a object at 0x00AF66F0>.__unicode__ __str__ accessing <__main__.a object at 0x00AF6390>.__unicode__ __str__ u'hihi'

I don't understand this part. Why is __unicode__ called? Your example doesn't appear to show this happening "once [i.e., because?] the result has been promoted to unicode" -- if that were true, it would "stand to reason" <wink> that the interpreter would then conclude it should call __unicode__ for all remaining %s, and not bother with __str__. If OTOH __unicode__ is called because __str__ returned a unicode object, it makes (very slightly) more sense that it goes through the same __str__-then-__unicode__ rigmarole for each object on the RHS of the %. But none of that seems to make a huge amount of sense. I've now found the September 2004 discussion of this, and I'm none the wiser. John

Neil Schemenauer

11:39 a.m.

John J Lee <jjl@pobox.com> wrote:

...

The note (4) says that the result will be unicode, but it doesn't say how, in this case, that comes about. This case is confusing because the docs claim string formatting with %s "converts ... using str()", and yet str(a()) returns a bytestring. Does it *really* use str, or just __str__? Surely the latter? (given the observed behaviour, and not reading the C source)

It uses __str__ and confirms that the returned object is a 'str' or 'unicode'. The docs are not precise but they were not for 2.4 either. Note the following case: '%s' % u'Hello!' The operand is not forced to be a str. Neil

John J Lee

11:55 a.m.

On Sun, 20 Aug 2006, Neil Schemenauer wrote:

...

John J Lee <jjl@pobox.com> wrote:

...
The note (4) says that the result will be unicode, but it doesn't say how, in this case, that comes about. This case is confusing because the docs claim string formatting with %s "converts ... using str()", and yet str(a()) returns a bytestring. Does it *really* use str, or just __str__? Surely the latter? (given the observed behaviour, and not reading the C source)

It uses __str__ and confirms that the returned object is a 'str' or 'unicode'. The docs are not precise but they were not for 2.4 either. Note the following case: [...]

OK, but I assume you're not saying that the fact that the docs were broken in 2.4 implies they shouldn't be fixed now? I would suggest revised wording, but I'm clearly confused about what actually goes on under the hood... John

Nick Coghlan

21 Aug 21 Aug

4:37 a.m.

John J Lee wrote:

...

...
And once the result has been promoted to unicode, __unicode__ is used directly:

...
...
...
print repr("%s%s" % (a(), a())) __str__ accessing <__main__.a object at 0x00AF66F0>.__unicode__ __str__ accessing <__main__.a object at 0x00AF6390>.__unicode__ __str__ u'hihi'

I don't understand this part. Why is __unicode__ called? Your example doesn't appear to show this happening "once [i.e., because?] the result has been promoted to unicode" -- if that were true, it would "stand to reason" <wink> that the interpreter would then conclude it should call __unicode__ for all remaining %s, and not bother with __str__.

It does try to call unicode directly, but because the example object doesn't supply __unicode__ it ends up falling back to __str__ instead. The behaviour is clearer when the example object provides both methods:

...

...
...
# Example (2.5b3) ... class a(object): ... def __str__(self): ... print "running __str__" ... return u'hi' ... def __unicode__(self): ... print "running __unicode__" ... return u'hi' ... print repr("%s%s" % (a(), a())) running __str__ running __unicode__ running __unicode__ u'hihi'

Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

John J Lee

22 Aug 22 Aug

12:42 p.m.

On Mon, 21 Aug 2006, Nick Coghlan wrote:

...

John J Lee wrote:

...
...
And once the result has been promoted to unicode, __unicode__ is used directly:

...
...
...
print repr("%s%s" % (a(), a())) __str__ accessing <__main__.a object at 0x00AF66F0>.__unicode__ __str__ accessing <__main__.a object at 0x00AF6390>.__unicode__ __str__ u'hihi'

I don't understand this part. Why is __unicode__ called? Your example doesn't appear to show this happening "once [i.e., because?] the result has been promoted to unicode" -- if that were true, it would "stand to reason" <wink> that the interpreter would then conclude it should call __unicode__ for all remaining %s, and not bother with __str__.

It does try to call unicode directly, but because the example object doesn't supply __unicode__ it ends up falling back to __str__ instead. The behaviour is clearer when the example object provides both methods: [...]

If the interpreter is falling back from __unicode__ to __str__ (rather than the other way around, kind-of), that makes much more sense. I understood that __unicode__ was not provided, of course -- what wasn't clear to me was why the interpreter was calling/accessing those methods/attributes in the sequence it does. Still not sure I understand what the third __str__ above comes from, but until I've thought it through again, that's my problem. John

Nick Coghlan

23 Aug 23 Aug

6:44 a.m.

John J Lee wrote:

...

On Mon, 21 Aug 2006, Nick Coghlan wrote:

...
John J Lee wrote:

...
...
And once the result has been promoted to unicode, __unicode__ is used directly:

...
...
> print repr("%s%s" % (a(), a())) __str__ accessing <__main__.a object at 0x00AF66F0>.__unicode__ __str__ accessing <__main__.a object at 0x00AF6390>.__unicode__ __str__ u'hihi' I don't understand this part. Why is __unicode__ called? Your example doesn't appear to show this happening "once [i.e., because?] the result has been promoted to unicode" -- if that were true, it would "stand to reason" <wink> that the interpreter would then conclude it should call __unicode__ for all remaining %s, and not bother with __str__. It does try to call unicode directly, but because the example object doesn't supply __unicode__ it ends up falling back to __str__ instead. The behaviour is clearer when the example object provides both methods: [...]

If the interpreter is falling back from __unicode__ to __str__ (rather than the other way around, kind-of), that makes much more sense. I understood that __unicode__ was not provided, of course -- what wasn't clear to me was why the interpreter was calling/accessing those methods/attributes in the sequence it does. Still not sure I understand what the third __str__ above comes from, but until I've thought it through again, that's my problem.

The sequence is effectively: x, y = a(), a() str(x) # calls x.__str__ unicode(x) # tries x.__unicode__, fails, falls back to x.__str__ unicode(y) # tries y.__unicode__, fails, falls back to y.__str__ The trick in 2.5 is that the '%s' format code, instead of actually calling str(x), calls x.__str__() directly, and promotes the result to Unicode if x.__str__() returns a Unicode result. I'll try to do something to clear up that section of the documentation before 2.5 final. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

6471

Age (days ago)

6474

Last active (days ago)

List overview

Download

7 comments

3 participants

participants (3)

John J Lee
Neil Schemenauer
Nick Coghlan

String formatting / unicode 2.5 bug?

John J Lee

Nick Coghlan

John J Lee

Neil Schemenauer

John J Lee

Nick Coghlan

John J Lee

Nick Coghlan

tags

participants (3)