Format strings, Unicode, and Py2.7: need clarification
Hi, While cleaning up some code during Python 2 -> Python 3 porting, I switched some code to use str.format(), I found this behavor: Python 2.7 ========= a = "%s" % "hi" b = "%s" % u"hi" c = u"%s" % "hi" d = "{}".format("hi") e = "{}".format(u"hi") f = u"{}".format("hi") type(a) == str type(b) == unicode type(c) == unicode type(d) == str type(e) == str type(f) == unicode My intuition would lead me to believe that type(b) and type(e) would be the same (unicode), but they are not. The confusion for me is why is type(e) of type str, and not unicode? Can someone clarify this for me? I understand that in Python 3, all these cases are str, so it is not as big a problem there, but I am trying to keep things working on Python 2.7. Thanks. -- Craig
On Wed, May 17, 2017 at 02:41:29PM -0700, Craig Rodrigues wrote:
e = "{}".format(u"hi") [...] type(e) == str
The confusion for me is why is type(e) of type str, and not unicode?
I think that's one of the reasons why the Python 2.7 string model is (1) convenient to those using purely ASCII, but (2) ultimately broken. You can see why it's broken if you do this: py> "{}".format(u"hiµ") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in position 2: ordinal not in range(128) So it tries to encode the Unicode string to ASCII, and if that succeeds, format returns a byte str. I'm not sure if that was a deliberate design choice for format, or just a side-effect of it calling str() on its arguments by default. I'm not sure if I've answered your question or not. Are you looking for justification of this misfeature, or an explanation of the historical reasons why it exists, or something else? (If you're looking for the same behaviour in Python 3 and 2.7, probably the best thing you can do is just religiously use unicode strings u'' in both. You might try: from __future__ import unicode_literals in 2.7, but I'm not sure that's enough.) -- Steve
On May 17, 2017, at 2:41 PM, Craig Rodrigues
wrote: Hi,
While cleaning up some code during Python 2 -> Python 3 porting, I switched some code to use str.format(), I found this behavor:
Python 2.7 ========= a = "%s" % "hi" b = "%s" % u"hi" c = u"%s" % "hi" d = "{}".format("hi") e = "{}".format(u"hi") f = u"{}".format("hi")
type(a) == str type(b) == unicode type(c) == unicode type(d) == str type(e) == str type(f) == unicode
My intuition would lead me to believe that type(b) and type(e) would be the same (unicode), but they are not. The confusion for me is why is type(e) of type str, and not unicode?
Can someone clarify this for me?
I think it's because I wanted to return str if possible, and didn't want to find out that one of the calls to __format__ returned unicode, and then go back and convert all of the previous results to unicode from str. And, I guess we didn't consider it important enough at the time. Eric.
I understand that in Python 3, all these cases are str, so it is not as big a problem there, but I am trying to keep things working on Python 2.7.
Thanks. -- Craig _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/eric%2Ba-python-dev%40tru...
Because `.format()` is a method on an instantiated `str` object in e and so
must return the same type so additional str methods could be stacked on
after it, like `.format(u'hi').decode()`. Whereas the % string
interpolation is a binary operation, so, like addition, where the more
general type can be used for the return value, analogous to `1 + 2.0`
returning a float.
--Hobson
(503) 974-6274
gh https://github.com/hobson/ twtr https://twitter.com/hobsonlane li
https://www.linkedin.com/in/hobsonlane g+
http://plus.google.com/+HobsonLane/ so
http://stackoverflow.com/users/623735/hobs
On Wed, May 17, 2017 at 2:41 PM, Craig Rodrigues
Hi,
While cleaning up some code during Python 2 -> Python 3 porting, I switched some code to use str.format(), I found this behavor:
Python 2.7 ========= a = "%s" % "hi" b = "%s" % u"hi" c = u"%s" % "hi" d = "{}".format("hi") e = "{}".format(u"hi") f = u"{}".format("hi")
type(a) == str type(b) == unicode type(c) == unicode type(d) == str type(e) == str type(f) == unicode
My intuition would lead me to believe that type(b) and type(e) would be the same (unicode), but they are not. The confusion for me is why is type(e) of type str, and not unicode?
Can someone clarify this for me?
I understand that in Python 3, all these cases are str, so it is not as big a problem there, but I am trying to keep things working on Python 2.7.
Thanks. -- Craig
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ hobsonlane%40gmail.com
On Thu, May 18, 2017, at 01:14, Hobson Lane wrote:
Because `.format()` is a method on an instantiated `str` object in e and so must return the same type
That it *must* return the same type is overstating the matter. Split returns a list (and, rather like %, the list is of unicode or str objects depending on the argument). Join will return a unicode object if any of the elements of the sequence are unicode. I was honestly surprised though to see that % returns unicode when formatting a unicode value, since my mental model of %s was more like {!s} - call str() on whatever object is at the given position in the right-hand argument. This kind of ad hoc implementation decision (format always returns str, other methods can return unicode, ljust/rjust refuse to accept a unicode character argument) is what Python 3 moved away from.
Yea, *must* probably should have been *should (unless indicated by the
method name)*. I forgot about the many methods that don't. But the names of
these methods, like `split()`, imply the types that they return.
Your example reminded me of the convention (outside python core) to name
getters and setters so that they imply their returned type. Similarly for
conversion methods like `to_*()`). But the name `format`, in my mind,
implies that it only changes the *contents* of the str, rather than
morphing its type.
The % example surprised me too, but it clicked for me when I realized % is
a binary operator and not a method.
--Hobson
(503) 974-6274
gh https://github.com/hobson/ twtr https://twitter.com/hobsonlane li
https://www.linkedin.com/in/hobsonlane g+
http://plus.google.com/+HobsonLane/ so
http://stackoverflow.com/users/623735/hobs
On Thu, May 18, 2017 at 7:25 AM, Random832
On Thu, May 18, 2017, at 01:14, Hobson Lane wrote:
Because `.format()` is a method on an instantiated `str` object in e and so must return the same type
That it *must* return the same type is overstating the matter. Split returns a list (and, rather like %, the list is of unicode or str objects depending on the argument). Join will return a unicode object if any of the elements of the sequence are unicode. I was honestly surprised though to see that % returns unicode when formatting a unicode value, since my mental model of %s was more like {!s} - call str() on whatever object is at the given position in the right-hand argument. This kind of ad hoc implementation decision (format always returns str, other methods can return unicode, ljust/rjust refuse to accept a unicode character argument) is what Python 3 moved away from. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ hobsonlane%40gmail.com
participants (5)
-
Craig Rodrigues
-
Eric V. Smith
-
Hobson Lane
-
Random832
-
Steven D'Aprano