Backporting PEP 3101 to 2.6

(I'm posting to python-dev, because this isn't strictly 3.0 related. Hopefully most people read it in addition to python-3000). I'm working on backporting the changes I made for PEP 3101 (Advanced String Formatting) to the trunk, in order to meet the pre-PyCon release date for 2.6a1. I have a few questions about how I should handle str/unicode. 3.0 was pretty easy, because everything was unicode. 1: How should the builtin format() work? It takes 2 parameters, an object o and a string s, and returns o.__format__(s). If s is None, it returns o.__format__(empty_string). In 3.0, the empty string is of course unicode. For 2.6, should I use u'' or ''? 2: In 3.0, object.__format__() is essentially this: class object: def __format__(self, format_spec): return format(str(self), format_spec) In 2.6, I assume it should be the equivalent of: class object: def __format__(self, format_spec): if isinstance(format_spec, str): return format(str(self), format_spec) elif isinstance(format_spec, unicode): return format(unicode(self), format_spec) else: error Does that seem right? 3: Every overridden __format__() method is going to have to check for string or unicode, just like object.__format() does, and return either a string or unicode object, appropriately. I don't see any way around this, but I'd like to hear any thoughts. I guess there aren't all that many __format__ methods that will be implemented, so this might not be a big burden. I'll of course implement the built in ones. Thanks in advance for any insights. Eric.

On 2008-01-10 14:31, Eric Smith wrote:
Since this is a new feature, why bother with strings at all (even in 2.6) ? Use Unicode throughout and be done with it.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 10 2008)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

M.-A. Lemburg wrote:
I was hoping someone would say that! It would certainly make things much easier. But for my own selfish reasons, I'd like to have str.format() work in 2.6. Other than the issues I raised here, I've already done the vast majority of the work for the code to support either string or unicode. For example, I put most of the implementation in Objects/stringlib, so I can include it either as string or unicode. But I can live with unicode only if that's the consensus. Eric.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jan 10, 2008, at 9:07 AM, M.-A. Lemburg wrote:
+1 - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBR4YrpHEjvBPtnXfVAQJcgwP+PV+XsqtZZ2aFA4yxIYRzkVVCyk+rwFSN H58DygPu4AQvhb1Dzuudag1OkfdpUHeRkvTyjSkUTWbK/03Y4R5A8X8iDkkQozQd m92DynvSEIOtX3WJZT4SOvGj+QavQC4FmkTPlEPNwqBkIl4GkjfOnwMsKx2lwKN+ rOXUf7Mtvd8= =1ME/ -----END PGP SIGNATURE-----

Eric Smith wrote:
I just re-read PEP 3101, and it doesn't mention this behavior with None. The way the code actually works is that the specifier is optional, and if it isn't present then it defaults to an empty string. This behavior isn't mentioned in the PEP, either. This feature came from a request from Talin[0]. We should either add this to the PEP (and docs), or remove it. If we document it, it should mention the 2.x behavior (as other places in the PEP do). If we removed it, it would remove the one place in the backport that's not just hard, but ambiguous. I'd just as soon see the feature go away, myself.
The PEP actually mentions that this is how 2.x will have to work. So I'll go ahead and implement it that way, on the assumption that getting string support into 2.6 is desirable. Eric. [0] http://mail.python.org/pipermail/python-3000/2007-August/010089.html

On Jan 10, 2008 9:57 AM, Eric Smith <eric+python-dev@trueblade.com> wrote:
IIUC, the 's' argument is the format specifier. Format specifiers are written in a very conservative character set, so I'm not sure it matters. Or are you assuming that the *type* of 's' also determines the type of the output? I may be in the minority here, but I think I like having a default for 's' (as implemented -- the PEP ought to be updated) and I also think it should default to an 8-bit string, assuming you support 8-bit strings at all -- after all in 2.x 8-bit strings are the default string type (as reflected by their name, 'str').
I think it is. (But then I still live in a predominantly ASCII world. :-) For data types whose output uses only ASCII, would it be acceptable if they always returned an 8-bit string and left it up to the caller to convert it to Unicode? This would apply to all numeric types. (The date/time types have a strftime() style API which means the user must be able to specifiy Unicode.) -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Yes, 's' is the format specifier. I should have used its actual name. I'm am saying that the type of 's' determines the type of the output. Maybe that's a needless assumption for the builtin format(), since it doesn't inspect the value of 's' (other than to verify its type). But for ''.format() and u''.format(), I was thinking it will be true (but see below). It just seems weird to me that the result of format(3, u'd') would be a '3', not u'3'.
As long as it's defined, I'm okay with it. I think making the 2.6 default be an empty str is reasonable.
I live in that same world, which is why I started implementing this to begin with! I've always been more interested in the ascii version for 2.6 than for the 3.0 unicode version. Doing it first in 3.0 was my way of getting it into 2.6.
I guess in str.format() I could require the result of format(obj, format_spec) to be a str, and in unicode.format() I could convert it to be unicode, which would either succeed or fail. I think all I need to do is have the numeric formatters work with both unicode and str format specifiers, and always return str results. That should be doable. As you say, the format specifiers for the numerics are restricted to 8-bit strings, anyway. Now that I think about it, the str .__format__() will also need to accept unicode and produce a str, for this to work: u"{0}{1}{2}".format('a', u'b', 3) I'll give these ideas a shot and see how far I get. Thanks for the feedback! Eric.

Guido van Rossum wrote:
To elaborate on this a bit (and handwaving a lot of important details out of the way) do you mean something like the following for the builtin format?: def format(obj, fmt_spec=None): if fmt_spec is None: fmt_spec='' result = obj.__format__(fmt_spec) if isinstance(fmt_spec, unicode): if isinstance(result, str): result = unicode(result) return result Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

Eric Smith wrote:
I'm finally getting around to finishing this up. The approach I've taken for int, long, and float, is that they take either unicode or str format specifiers, and always return str results. The builtin format() deals with converting str to unicode, if the format specifier was originally unicode. This all works great. It allows me to easily implement both ''.format and u''.format taking int, long, and float parameters. I'm now working on datetime. The __format__ method is really just a wrapper around strftime. I was assuming (or rather hoping) that strftime does the right thing with unicode and str (unicode in = unicode out, str in = str out). But it turns out strftime doesn't accept unicode: $ ./python Python 2.6a0 (trunk:60845M, Feb 15 2008, 21:09:57) [GCC 4.1.2 20070626 (Red Hat 4.1.2-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
As part of this task, I'm really not up to the job of changing strftime to support both str and unicode inputs. So I think I'll put all of the __format__ code in place to support it if and when strftime supports unicode. In the meantime, it won't be possible for u''.format to work with datetime objects.
The bad error message is a result of __format__ passing on unicode to strftime. There are, of course, various ugly ways to work around this involving nested format calls. Maybe I'll extend strftime to unicode for the PyCon sprint. Eric.

Eric Smith wrote:
I don't know if this fits your definition of "ugly workaround", but what if datetime.__format__ did something like: def __format__(self, spec): encoding = None if isinstance(spec, unicode): encoding = 'utf-8' spec = spec.encode(encoding) result = strftime(spec, self) if encoding is not None: result = result.decode(encoding) return result Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

* Nick Coghlan wrote:
Note that hardcoding utf-8 is a bad guess here as strftime(3) emits locale strings, so decoding will easily fail. I guess, a clean and complete solution (besides re-implementing the whole thing) would be to resolve each single format character with strftime, decode according to the locale and re-assemble the result string piece by piece. Doh! nd --

André Malo wrote:
That's along the lines of what I was thinking. strftime already does some of this to support %[zZ]. But now that I look at time.strftime in py3k, it's converting the entire unicode string to a char string with PyUnicode_AsString, then converting back with PyUnicode_Decode.

* Eric Smith wrote:
Looks wrong to me, too... :-) nd -- $_=q?tvc!uif)%*|#Bopuifs!A`#~tvc!Xibu)%*|qsjou#Kvtu!A`#~tvc!KBQI!)*|~ tvc!ifmm)%*|#Qfsm!A`#~tvc!jt)%*|(Ibdlfs(~ # What the hell is JAPH? ; @_=split/\s\s+#/;$_=(join''=>map{chr(ord( # André Malo ; $_)-1)}split//=>$_[0]).$_[1];s s.*s$_see; # http://www.perlig.de/ ;

André Malo wrote:
I don't understand Unicode encoding/decoding well enough to describe this bug, but I admit it looks suspicious. Could someone who does understand it open a bug against 3.0 (hopefully with an example that fails)? The bug should also mention that 2.6 avoids this problem entirely by not supporting unicode with strftime or datetime.__format__, but 2.6 could probably leverage whatever solution is developed for 3.0. Thanks.

Nick Coghlan wrote:
Isn't unicode idempotent? Couldn't if isinstance(result, str): result = unicode(result) avoid repeating in Python a test already made in C by re-spelling it as result = unicode(result) or have you hand-waved away important details that mean the test really is required? regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/

Steve Holden wrote:
This code is written in C. It already has a check to verify that the return from __format__ is either str or unicode, and another check that fmt_spec is str or unicode. So doing the conversion only if result is str and fmt_spec is unicode would be a cheap decision. Good catch, though. I wouldn't have thought of it, and there are parts that are written in Python, so maybe I can leverage this elsewhere. Thanks! Eric.

On 2008-01-10 14:31, Eric Smith wrote:
Since this is a new feature, why bother with strings at all (even in 2.6) ? Use Unicode throughout and be done with it.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 10 2008)
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

M.-A. Lemburg wrote:
I was hoping someone would say that! It would certainly make things much easier. But for my own selfish reasons, I'd like to have str.format() work in 2.6. Other than the issues I raised here, I've already done the vast majority of the work for the code to support either string or unicode. For example, I put most of the implementation in Objects/stringlib, so I can include it either as string or unicode. But I can live with unicode only if that's the consensus. Eric.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jan 10, 2008, at 9:07 AM, M.-A. Lemburg wrote:
+1 - -Barry -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iQCVAwUBR4YrpHEjvBPtnXfVAQJcgwP+PV+XsqtZZ2aFA4yxIYRzkVVCyk+rwFSN H58DygPu4AQvhb1Dzuudag1OkfdpUHeRkvTyjSkUTWbK/03Y4R5A8X8iDkkQozQd m92DynvSEIOtX3WJZT4SOvGj+QavQC4FmkTPlEPNwqBkIl4GkjfOnwMsKx2lwKN+ rOXUf7Mtvd8= =1ME/ -----END PGP SIGNATURE-----

Eric Smith wrote:
I just re-read PEP 3101, and it doesn't mention this behavior with None. The way the code actually works is that the specifier is optional, and if it isn't present then it defaults to an empty string. This behavior isn't mentioned in the PEP, either. This feature came from a request from Talin[0]. We should either add this to the PEP (and docs), or remove it. If we document it, it should mention the 2.x behavior (as other places in the PEP do). If we removed it, it would remove the one place in the backport that's not just hard, but ambiguous. I'd just as soon see the feature go away, myself.
The PEP actually mentions that this is how 2.x will have to work. So I'll go ahead and implement it that way, on the assumption that getting string support into 2.6 is desirable. Eric. [0] http://mail.python.org/pipermail/python-3000/2007-August/010089.html

On Jan 10, 2008 9:57 AM, Eric Smith <eric+python-dev@trueblade.com> wrote:
IIUC, the 's' argument is the format specifier. Format specifiers are written in a very conservative character set, so I'm not sure it matters. Or are you assuming that the *type* of 's' also determines the type of the output? I may be in the minority here, but I think I like having a default for 's' (as implemented -- the PEP ought to be updated) and I also think it should default to an 8-bit string, assuming you support 8-bit strings at all -- after all in 2.x 8-bit strings are the default string type (as reflected by their name, 'str').
I think it is. (But then I still live in a predominantly ASCII world. :-) For data types whose output uses only ASCII, would it be acceptable if they always returned an 8-bit string and left it up to the caller to convert it to Unicode? This would apply to all numeric types. (The date/time types have a strftime() style API which means the user must be able to specifiy Unicode.) -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Yes, 's' is the format specifier. I should have used its actual name. I'm am saying that the type of 's' determines the type of the output. Maybe that's a needless assumption for the builtin format(), since it doesn't inspect the value of 's' (other than to verify its type). But for ''.format() and u''.format(), I was thinking it will be true (but see below). It just seems weird to me that the result of format(3, u'd') would be a '3', not u'3'.
As long as it's defined, I'm okay with it. I think making the 2.6 default be an empty str is reasonable.
I live in that same world, which is why I started implementing this to begin with! I've always been more interested in the ascii version for 2.6 than for the 3.0 unicode version. Doing it first in 3.0 was my way of getting it into 2.6.
I guess in str.format() I could require the result of format(obj, format_spec) to be a str, and in unicode.format() I could convert it to be unicode, which would either succeed or fail. I think all I need to do is have the numeric formatters work with both unicode and str format specifiers, and always return str results. That should be doable. As you say, the format specifiers for the numerics are restricted to 8-bit strings, anyway. Now that I think about it, the str .__format__() will also need to accept unicode and produce a str, for this to work: u"{0}{1}{2}".format('a', u'b', 3) I'll give these ideas a shot and see how far I get. Thanks for the feedback! Eric.

Guido van Rossum wrote:
To elaborate on this a bit (and handwaving a lot of important details out of the way) do you mean something like the following for the builtin format?: def format(obj, fmt_spec=None): if fmt_spec is None: fmt_spec='' result = obj.__format__(fmt_spec) if isinstance(fmt_spec, unicode): if isinstance(result, str): result = unicode(result) return result Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

Eric Smith wrote:
I'm finally getting around to finishing this up. The approach I've taken for int, long, and float, is that they take either unicode or str format specifiers, and always return str results. The builtin format() deals with converting str to unicode, if the format specifier was originally unicode. This all works great. It allows me to easily implement both ''.format and u''.format taking int, long, and float parameters. I'm now working on datetime. The __format__ method is really just a wrapper around strftime. I was assuming (or rather hoping) that strftime does the right thing with unicode and str (unicode in = unicode out, str in = str out). But it turns out strftime doesn't accept unicode: $ ./python Python 2.6a0 (trunk:60845M, Feb 15 2008, 21:09:57) [GCC 4.1.2 20070626 (Red Hat 4.1.2-13)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
As part of this task, I'm really not up to the job of changing strftime to support both str and unicode inputs. So I think I'll put all of the __format__ code in place to support it if and when strftime supports unicode. In the meantime, it won't be possible for u''.format to work with datetime objects.
The bad error message is a result of __format__ passing on unicode to strftime. There are, of course, various ugly ways to work around this involving nested format calls. Maybe I'll extend strftime to unicode for the PyCon sprint. Eric.

Eric Smith wrote:
I don't know if this fits your definition of "ugly workaround", but what if datetime.__format__ did something like: def __format__(self, spec): encoding = None if isinstance(spec, unicode): encoding = 'utf-8' spec = spec.encode(encoding) result = strftime(spec, self) if encoding is not None: result = result.decode(encoding) return result Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

* Nick Coghlan wrote:
Note that hardcoding utf-8 is a bad guess here as strftime(3) emits locale strings, so decoding will easily fail. I guess, a clean and complete solution (besides re-implementing the whole thing) would be to resolve each single format character with strftime, decode according to the locale and re-assemble the result string piece by piece. Doh! nd --

André Malo wrote:
That's along the lines of what I was thinking. strftime already does some of this to support %[zZ]. But now that I look at time.strftime in py3k, it's converting the entire unicode string to a char string with PyUnicode_AsString, then converting back with PyUnicode_Decode.

* Eric Smith wrote:
Looks wrong to me, too... :-) nd -- $_=q?tvc!uif)%*|#Bopuifs!A`#~tvc!Xibu)%*|qsjou#Kvtu!A`#~tvc!KBQI!)*|~ tvc!ifmm)%*|#Qfsm!A`#~tvc!jt)%*|(Ibdlfs(~ # What the hell is JAPH? ; @_=split/\s\s+#/;$_=(join''=>map{chr(ord( # André Malo ; $_)-1)}split//=>$_[0]).$_[1];s s.*s$_see; # http://www.perlig.de/ ;

André Malo wrote:
I don't understand Unicode encoding/decoding well enough to describe this bug, but I admit it looks suspicious. Could someone who does understand it open a bug against 3.0 (hopefully with an example that fails)? The bug should also mention that 2.6 avoids this problem entirely by not supporting unicode with strftime or datetime.__format__, but 2.6 could probably leverage whatever solution is developed for 3.0. Thanks.

Nick Coghlan wrote:
Isn't unicode idempotent? Couldn't if isinstance(result, str): result = unicode(result) avoid repeating in Python a test already made in C by re-spelling it as result = unicode(result) or have you hand-waved away important details that mean the test really is required? regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/

Steve Holden wrote:
This code is written in C. It already has a check to verify that the return from __format__ is either str or unicode, and another check that fmt_spec is str or unicode. So doing the conversion only if result is str and fmt_spec is unicode would be a cheap decision. Good catch, though. I wouldn't have thought of it, and there are parts that are written in Python, so maybe I can leverage this elsewhere. Thanks! Eric.
participants (7)
-
André Malo
-
Barry Warsaw
-
Eric Smith
-
Guido van Rossum
-
M.-A. Lemburg
-
Nick Coghlan
-
Steve Holden