[issue9196] Improve docs for string interpolation "%s" re Unicode strings
New submission from Craig McQueen <python@craig.mcqueen.id.au>: I have just been trying to figure out how string interpolation works for "%s", when Unicode strings are involved. It seems it's a bit complicated, but the Python documentation doesn't really describe it. It just says %s "converts any Python object using str()". Here is what I have found (I think), and it could be worth improving the documentation of this somehow. Example 1: "%s" % test_object >From what I can tell, in this case: 1. test_object.__str__() is called. 2. If test_object.__str__() returns a string object, then that is substituted. 3. If test_object.__str__() returns a Unicode object (for some reason), then test_object.__unicode__() is called, then _that_ is substituted instead. The output string is turned into Unicode. This behaviour is surprising. [Note that the call to test_object.__str__() is not the same as str(test_object), because the former can return a Unicode object without causing an error, while the latter, if it gets a Unicode object, will then try to encode('ascii') to a string, possibly generating a UnicodeEncodeError exception.] Example 2: u"%s" % test_object In this case: 1. test_object.__unicode__() is called, if it exists, and the result is substituted. The output string is Unicode. 2. If test_object.__unicode__() doesn't exist, then test_object.__str__() is called instead, converted to Unicode, and substituted. The output string is Unicode. Example 3: "%s %s" % (u'unicode', test_object) In this case: 1. The first substitution causes the output string to be Unicode. 2. It seems that (1) causes the second substitution to follow the same rules as Example 2. This is a little surprising. ---------- assignee: docs@python components: Documentation messages: 109516 nosy: cmcqueen1975, docs@python priority: normal severity: normal status: open title: Improve docs for string interpolation "%s" re Unicode strings versions: Python 2.7 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9196> _______________________________________
Changes by Ezio Melotti <ezio.melotti@gmail.com>: ---------- nosy: +ezio.melotti stage: -> needs patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9196> _______________________________________
Craig McQueen <python@craig.mcqueen.id.au> added the comment: Another thing I discovered, for Example 1: 4. If test_object.__str__() returns a Unicode object (for some reason), and test_object.__unicode__() does not exist, then the Unicode value from the __str__() call is used as-is (no conversion to string, no encoding errors). This is also a little surprising [in this situation unicode(test_object) also returns the Unicode object returned by __str__() as-is, so I guess there's some consistency there]. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9196> _______________________________________
Éric Araujo <merwok@netwok.org> added the comment: I’m not sure how much effort should be put into a patch here, considering that the horrible bytes/text confusion and implicit conversion should stop in Python 3, and %-formatting is mildly deprecated. Ezio, what do you think? Craig, could you attach your test_object class and test code? I wonder if the mixed behavior is still present in 3.x. ---------- nosy: +eric.araujo _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9196> _______________________________________
Craig McQueen <python@craig.mcqueen.id.au> added the comment: I should be able to attach my test code. But it is at my work, and I'm on holidays for 2 more weeks. Sorry 'bout that! I do assume that Python 3 greatly simplifies this. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9196> _______________________________________
Craig McQueen <python@craig.mcqueen.id.au> added the comment: I'm attaching a file that I used (in Python 2.x). It's a little rough--I manually commented and uncommented various lines to see what would change under various circumstances. But at least you should be able to see what I was doing. ---------- Added file: http://bugs.python.org/file20334/class_str_unicode_methods.py _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9196> _______________________________________
Ezio Melotti <ezio.melotti@gmail.com> added the comment: Python 3 checks the return types of __bytes__ and __str__, raising an error if it's not bytes and str respectively:
str(C()) TypeError: __str__ returned non-string (type bytes) bytes(C()) TypeError: __bytes__ returned non-bytes (type str)
The Python 2 doc for unicode() says[0]: """ For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode. """ The doc for .__unicode__() says[1]: """ Called to implement unicode() built-in; should return a Unicode object. When this method is not defined, string conversion is attempted, and the result of string conversion is converted to Unicode using the system default encoding. """ This is consistent with unicode() doc (but it doesn't mention that 'strict' is used). It also says that the method *should* return unicode, but it can also returns a str that gets coerced by unicode(). The doc for .__str__() says[2]: """ Called by the str() built-in function and by the print statement to compute the “informal” string representation of an object. [...] The return value must be a string object. """ This is wrong because the return value can be unicode too (this has been changed at some point, it used to be true on older versions). That said, some of the behaviors described by Craig (e.g. __str__ that returns unicode) are not documented and documenting them might save some confusion. However these "weird" behaviors are most likely errors and the fact that there are no exception is just because Python 2 is not strict with str/unicode. I think a better way to solve the problem is to document clearly how these methods should be used (i.e. if __unicode__ should be preferred over __str__, if it's necessary to implement both, what they should return, etc.). [0]: http://docs.python.org/library/functions.html#unicode [1]: http://docs.python.org/reference/datamodel.html#object.__unicode__ [2]: http://docs.python.org/reference/datamodel.html#object.__str__ ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9196> _______________________________________
Changes by Arfrever Frehtes Taifersar Arahesis <Arfrever.FTA@GMail.Com>: ---------- nosy: +Arfrever _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9196> _______________________________________
Éric Araujo <merwok@netwok.org> added the comment: More info on this thread: http://mail.python.org/pipermail/python-dev/2006-December/070237.html ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9196> _______________________________________
Changes by Ezio Melotti <ezio.melotti@gmail.com>: ---------- nosy: +eric.smith _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9196> _______________________________________
Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment: Python 2.7 is no longer supported. ---------- nosy: +serhiy.storchaka resolution: -> out of date stage: needs patch -> resolved status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <https://bugs.python.org/issue9196> _______________________________________
participants (5)
-
Arfrever Frehtes Taifersar Arahesis
-
Craig McQueen
-
Ezio Melotti
-
Serhiy Storchaka
-
Éric Araujo