[Python-Dev] String formatting / unicode 2.5 bug?

Sun Aug 20 14:45:08 CEST 2006

On Sun, 20 Aug 2006, Nick Coghlan wrote:

> John J Lee wrote:
>>  Is this a bug?
>
> I don't believe so - the string formatting documentation states that the 
> result will be unicode if either the format string is unicode or any of the 
> objects passed to a %s format code is unicode.
>
> That latter part has just been extended to include any object that returns 
> Unicode from __str__, instead of being restricted to actual Unicode 
> instances.
>
> Note that the following behaves the same way regardless of whether you use 
> 2.4 or 2.5:
> "%s" % 'hi'
> "%s" % u'hi'

Given that, the following wording should be changed:

http://docs.python.org/lib/typesseq-strings.html

Conversion  Meaning                                           Notes
...
s           String (converts any python object using str()).  (4)
...
(4) If the object or format provided is a unicode string, the resulting 
string will also be unicode.

The note (4) says that the result will be unicode, but it doesn't say how, 
in this case, that comes about.  This case is confusing because the docs 
claim string formatting with %s "converts ... using str()", and yet 
str(a()) returns a bytestring.  Does it *really* use str, or just __str__? 
Surely the latter? (given the observed behaviour, and not reading the C 
source)

FWIW, this change broke epydoc (fails with an AssertionError -- so perhaps 
without the assert it would still "work", dunno).

> And once the result has been promoted to unicode, __unicode__ is used 
> directly:
>
>> > >  print repr("%s%s" % (a(), a()))
> __str__
> accessing <__main__.a object at 0x00AF66F0>.__unicode__
> __str__
> accessing <__main__.a object at 0x00AF6390>.__unicode__
> __str__
> u'hihi'

I don't understand this part.  Why is __unicode__ called?  Your example 
doesn't appear to show this happening "once [i.e., because?] the result 
has been promoted to unicode" -- if that were true, it would "stand to 
reason" <wink> that the interpreter would then conclude it should call
__unicode__ for all remaining %s, and not bother with __str__.  If OTOH 
__unicode__ is called because __str__ returned a unicode object, it makes
(very slightly) more sense that it goes through the same 
__str__-then-__unicode__ rigmarole for each object on the RHS of the %.

But none of that seems to make a huge amount of sense.  I've now found the 
September 2004 discussion of this, and I'm none the wiser.

John