
I've been exploring how to customize our thousands separators and decimal separators and wanted to offer-up an idea. It arose when I was looking at Java's DecimalFormat class and its customization tool DecimalFormatSymbols http://java.sun.com/javase/6/docs/api/java/text/DecimalFormat.html . Also, I looked at how regular expression patterns provide options to change the meaning of its special characters using (?iLmsux). I. Simplest version -- Translation pairs format(1234, "8,.1f") --> ' 1,234.0' format(1234, "(,_)8,.1f") --> ' 1_234.0' format(1234, "(,_)(.,)8,.1f") --> ' 1_234,0' This approach is very easy to implement and it doesn't make life difficult for the parser which can continue to look for just a comma and period with their standardized meaning. It also fits nicely in our current framework and doesn't require any changes to the format() builtin. Of all the options, I find this one to be the easiest to read. Also, this version makes it easy to employ a couple of techniques to factor-out formatting decisions. Here's a gettext() style approach. def _(s): return '(,.)(.,)' + s . . . format(x, _('8.1f')) Here's another approach using implicit string concatenation: DEB = '(,_)' # style for debugging EXT = '(, )' # style for external display . . . format(x, DEB '8.1f') format(y, EXT '8d') There are probably many ways to factor-out the decision. We don't need to decide which is best, we just need to make it possible. One other thought, this approach makes it possible to customize all of the characters that are currently hardwired (including zero and space padding characters and the 'E' or 'e' exponent symbols). II. Javaesque version -- FormatSymbols object This is essentially the same idea as previous one but involves modifying the format() builtin to accept a symbols object and pass it to __format__ methods. This moves the work outside of the format string itself: DEB = FormatSymbols(comma='_') EXT = FormatSymbols(comma=' ') . . . format(x, '8.1f', DEB) format(y, '8d', EXT) The advantage is that this technique is easily extendable beyond simple symbol translations and could possibly allow specification of grouping sizes in hundreds and whatnot. It also looks more like a real program as opposed to a formatting mini-language. The disadvantage is that it is likely slower and it requires mucking with the currently dirt simple format() / __format__() protocol. It may also be harder to integrate with existing __format__ methods which are currently very string oriented. Raymond

Raymond Hettinger wrote:
I've been exploring how to customize our thousands separators and decimal separators and wanted to offer-up an idea. It arose when I was looking at Java's DecimalFormat class and its customization tool DecimalFormatSymbols http://java.sun.com/javase/6/docs/api/java/text/DecimalFormat.html . Also, I looked at how regular expression patterns provide options to change the meaning of its special characters using (?iLmsux).
I. Simplest version -- Translation pairs
format(1234, "8,.1f") --> ' 1,234.0' format(1234, "(,_)8,.1f") --> ' 1_234.0' format(1234, "(,_)(.,)8,.1f") --> ' 1_234,0'
This approach is very easy to implement and it doesn't make life difficult for the parser which can continue to look for just a comma and period with their standardized meaning. It also fits nicely in our current framework and doesn't require any changes to the format() builtin. Of all the options, I find this one to be the easiest to read.
I strongly prefer suffix to prefix modification. The format gives the overall structure of the output, the rest are details, which a reader may not care so much about.
Also, this version makes it easy to employ a couple of techniques to factor-out
These techniques apply to any "augment the basic format with an affix" method.
formatting decisions. Here's a gettext() style approach.
def _(s): return '(,.)(.,)' + s . . . format(x, _('8.1f'))
Here's another approach using implicit string concatenation:
DEB = '(,_)' # style for debugging EXT = '(, )' # style for external display . . . format(x, DEB '8.1f') format(y, EXT '8d')
There are probably many ways to factor-out the decision. We don't need to decide which is best, we just need to make it possible.
One other thought, this approach makes it possible to customize all of the characters that are currently hardwired (including zero and space padding characters and the 'E' or 'e' exponent symbols).
Any "augment the format with affixes" method should do the same. I prefer at most a separator (;) between affixes rather than fences around them. I also prefer, mnemonic key letters to mark the start of each affix, such as in Guido's quick suggestion: Thousands, Decimal_point, Exponent, Grouping, Pad_char, Money, and so on. But I do not think '=' is needed. Since the replacement will almost always be a single non-captital letter char, I am not sure a separator is even needed, but it would make parsing much easier. G would be followed by one or more digits indicating grouping from Decimal_point leftward, with the last repeated. If grouping by 9s is not large enough, allow a-f to get grouping up to 15 ;-). Example above would be format(1234, '8.1f;T.;P,')
II. Javaesque version -- FormatSymbols object
This is essentially the same idea as previous one but involves modifying the format() builtin to accept a symbols object and pass it to __format__ methods. This moves the work outside of the format string itself:
DEB = FormatSymbols(comma='_') EXT = FormatSymbols(comma=' ') . . . format(x, '8.1f', DEB) format(y, '8d', EXT)
The advantage is that this technique is easily extendable beyond simple symbol translations and could possibly allow specification of grouping sizes in hundreds and whatnot. It also looks more like a real program as opposed to a formatting mini-language. The disadvantage is that it is likely slower and it requires mucking with the currently dirt simple format() / __format__() protocol. It may also be harder to integrate with existing __format__ methods which are currently very string oriented.
I suggested in the thread in exposing the format parse result that the resulting structure (dict or named tuple) could become an alternative, wordy interface to the format functions. I think the mini-language itself should stay mini. Terry Jan Reedy

[Terry Reedy]
I strongly prefer suffix to prefix modification.
Given the way that the formatting parsers are written, I think suffix would work just as well as prefix. Also, your idea may help with the mental parsing as well (because the rest of the format string uses the untranslated symbols so that translation pairs should be at the end).
Also, this version makes it easy to employ a couple of techniques to factor-out
These techniques apply to any "augment the basic format with an affix" method.
Right.
I also prefer, mnemonic key letters to mark the start of each affix, ... format(1234, '8.1f;T.;P,')
I think it's better to be explicit that periods are translated to commas and commas to periods. Introducing a new letter just adds more to more memory load and makes the notation more verbose. In the previous newgroup discussions, people reacted badly to letter mnemonics finding them to be so ugly that they would refuse to use them (remember the early proposal of format(x,"8T,.f)). Also, the translation pairs approach lets you swap other hardwired characters like the E or a 0 pad. Raymond

Raymond Hettinger wrote:
[Terry Reedy]
I strongly prefer suffix to prefix modification.
Given the way that the formatting parsers are written, I think suffix would work just as well as prefix. Also, your idea may help with the mental parsing as well (because the rest of the format string uses the untranslated symbols so that translation pairs should be at the end).
Also, this version makes it easy to employ a couple of techniques to factor-out
These techniques apply to any "augment the basic format with an affix" method.
Right.
I also prefer, mnemonic key letters to mark the start of each affix, ... format(1234, '8.1f;T.;P,')
This should have been format(1234, '8.1f;T.;D,')
I think it's better to be explicit that periods are translated to commas and commas to periods. Introducing a new letter just adds more to more memory load and makes the notation more verbose. In the previous newgroup discussions, people reacted badly to letter mnemonics finding them to be so ugly that they would refuse to use them (remember the early proposal of format(x,"8T,.f)).
Also, the translation pairs approach lets you swap other hardwired characters like the E or a 0 pad.
So does the key letter approach. The pairs approach does not allow easy alteration of the grouping spec, because there is no hard-wired char to swap, unless you would allow something cryptic like {3,(4,2,3)) (for India, I believe). Even with the tranlation pair, one could use a separator rather than fences. format(1234, '8.1f;T.;D,') # could be format(1234, '8,.1f;,.;.,') The two approachs could even be mixed by using a char only when clearer, such as 'G' for grouping instead of '3' for the existing grouping value. I think whatever scheme adopted should be complete. Terry Jan Reedy

Mark Dickinson's test code suggested a good, extensible approach to the problem. Here's the idea in a nutshell: format(value, format_spec='', conventions=None) 'calls value.__format__(format_spec, conventions)' Where conventions is an optional dictionary with formatting control values. Any value object can accept custom controls, but the names for standard ones would be taken from the standards provided by localeconv(): { 'decimal_point': '.', 'grouping': [3, 0], 'negative_sign': '-', 'positive_sign': '', 'thousands_sep': ','} The would let you store several locales using localeconv() and use them at will, thus solving the global variable and threading problems with locale: import locale loc = locale.getlocale() # get current locale locale.setlocale(locale.LC_ALL, 'de_DE') DE = locale.localeconv() locale.setlocale(locale.LC_ALL, 'en_US') US = locale.localeconv() locale.setlocale(locale.LC_ALL, loc) # restore saved locale . . . format(x, '8,.f', DE) format(y, '8,d', US) It also lets you write your own conventions on the fly: DEB = dict(thousands_sep='_') # style for debugging EXT = dict(thousands_sep=',') # style for external display . . . format(x, '8.1f', DEB) format(y, '8d', EXT) Raymond

Where conventions is an optional dictionary with formatting control values. Any value object can accept custom controls, but the names for standard ones would be taken from the standards provided by localeconv():
Forgot to mention that this approach make life easier on people writing __format__ methods because it lets them re-use the work they've already done to implement the "n" type specifier. Also, this approach is very similar to the one taken in Java with its DecimalFormatSymbols object. The main differences are that they use a custom class instead of a dictionary, that we would use standard names that work well with localeconv(), and that our approach is extensible for use with custom formatters (i.e. the datetime module could have its own set of key/value pairs for formatting controls). Raymond

Am curious whether you guys like this proposal? Raymond ----- Original Message ----- [Raymond Hettinger]
Mark Dickinson's test code suggested a good, extensible approach to the problem. Here's the idea in a nutshell:
format(value, format_spec='', conventions=None) 'calls value.__format__(format_spec, conventions)'
Where conventions is an optional dictionary with formatting control values. Any value object can accept custom controls, but the names for standard ones would be taken from the standards provided by localeconv():
{ 'decimal_point': '.', 'grouping': [3, 0], 'negative_sign': '-', 'positive_sign': '', 'thousands_sep': ','}
The would let you store several locales using localeconv() and use them at will, thus solving the global variable and threading problems with locale:
import locale loc = locale.getlocale() # get current locale locale.setlocale(locale.LC_ALL, 'de_DE') DE = locale.localeconv() locale.setlocale(locale.LC_ALL, 'en_US') US = locale.localeconv() locale.setlocale(locale.LC_ALL, loc) # restore saved locale
. . .
format(x, '8,.f', DE) format(y, '8,d', US)
It also lets you write your own conventions on the fly:
DEB = dict(thousands_sep='_') # style for debugging EXT = dict(thousands_sep=',') # style for external display . . . format(x, '8.1f', DEB) format(y, '8d', EXT)
Raymond _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

Antoine Pitrou wrote:
Raymond Hettinger <python@...> writes:
Am curious whether you guys like this proposal?
I find it good for the builtin format() function, but how does it work for str.format()?
I agree: I like it, but it's not enough. I use str.format() way more often than I hope to ever use builtin format(). If we make any change, I'd rather see it focused on the format mini-language. Eric.

Eric Smith wrote:
Antoine Pitrou wrote:
Raymond Hettinger <python@...> writes:
Am curious whether you guys like this proposal?
I find it good for the builtin format() function, but how does it work for str.format()?
I agree: I like it, but it's not enough. I use str.format() way more often than I hope to ever use builtin format(). If we make any change, I'd rather see it focused on the format mini-language.
Perhaps we could add a new ! type to the formatting language that allows the developer to mark a particular argument as the conventions dictionary? Then you could do something like: # DE and US dicts as per Raymond's format() example fmt = "The value is {:,.5f}{!conv}" fmt.format(num, DE) fmt.format(num, US) fmt.format(num, dict(thousands_sep=''')) As with !a and !s, you could use any normal field specifier to select the conventions dictionary. Obviously, the formatting arguments would be ignored for that particular field. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

I haven't entirely been following this conversation, so I may be missing something, but what about something like: "Balance = ${balance:{minilang}}".format(balance=1.00, minilang=mini_formatter(thousands_sep=",", ...)) That way, even if the mini-language gets really confusing we'll have an easy to call function that manages it. I always thought it was weird that things co

Carl Johnson wrote:
I haven't entirely been following this conversation, so I may be missing something, but what about something like:
"Balance = ${balance:{minilang}}".format(balance=1.00, minilang=mini_formatter(thousands_sep=",", ...))
That way, even if the mini-language gets really confusing we'll have an easy to call function that manages it. I always thought it was weird that things co
Sorry, Google mail has been being weird lately, signing me out and suddenly sending mail, etc. …So, I always thought it was weird that {}s could nest in the new format language, but if we have that capability, we may as well use it. -- Carl

Nick Coghlan wrote:
Eric Smith wrote:
I agree: I like it, but it's not enough. I use str.format() way more often than I hope to ever use builtin format(). If we make any change, I'd rather see it focused on the format mini-language.
Perhaps we could add a new ! type to the formatting language that allows the developer to mark a particular argument as the conventions dictionary? Then you could do something like:
# DE and US dicts as per Raymond's format() example fmt = "The value is {:,.5f}{!conv}"
A new conversion specifier should follow the current pattern and be a single letter, such as 'c' for 'custom' or 'd' for dict. If, as I would expect, str.format scans left to right and interprets and replaces each field spec as it goes, then the above would not work. So put the conversion field before the fields it applies to. This, of course, makes string formatting stateful. With a 'shift lock' field added, an 'unshift' field should also be added. This, though, has the problem that a blank 'field-name' will in 3.1 either be auto-numbered or flagged as an error (if there are other explicitly numbered fields). I am a little uneasy about 'replacement fields' that are not really replacement fields.
fmt.format(num, DE) fmt.format(num, US) fmt.format(num, dict(thousands_sep='''))
As with !a and !s, you could use any normal field specifier to select the conventions dictionary. Obviously, the formatting arguments would be ignored for that particular field.
Terry Jan Reedy

# DE and US dicts as per Raymond's format() example fmt = "The value is {:,.5f}{!conv}"
A new conversion specifier should follow the current pattern and be a single letter, such as 'c' for 'custom' or 'd' for dict.
If, as I would expect, str.format scans left to right and interprets and replaces each field spec as it goes, then the above would not work. So put the conversion field before the fields it applies to.
My interpretation is that the conv-dictionary applies to the whole string (not field-by-field) and that it can go at the end (because it doesn't affect parsing, rather it applies to the translation phase). Raymond

Raymond Hettinger wrote:
# DE and US dicts as per Raymond's format() example fmt = "The value is {:,.5f}{!conv}"
A new conversion specifier should follow the current pattern and be a single letter, such as 'c' for 'custom' or 'd' for dict.
If, as I would expect, str.format scans left to right and interprets and replaces each field spec as it goes, then the above would not work. So put the conversion field before the fields it applies to.
My interpretation is that the conv-dictionary applies to the whole string (not field-by-field)
That was not specified. If so, then a statement like """A number such as {0:15.2f} can be formatted many ways: USA: {0:15,.2f), EU: {0:15<whatever>f}, India: {0:15<whatever>f), China {0:15<whatever>f)" would not be possible. Why not allow extra flexibility? Unless the conversion is set by setting a global variable ala locale, the c-dict will be *used* field-by-field in each call to ob.__format__(fmt, conv), so there is no reason to force each call in a particular series to use the same conversion.
and that it can go at the end (because it doesn't affect parsing, rather it applies to the translation phase).
We agree that parsing out the conversion spec must happen before the translation it affects. If, as I supposed above (because of how I would think to write the code), parsing and translation are intermixed, then parsing the spec *after* translation will not work. Even if they are done in two batches, it would still be easy to rebind the c-dict var during the second-phase scan of the replacement fields. Terry Jan Reedy

My interpretation is that the conv-dictionary applies to the whole string (not field-by-field)
That was not specified. If so, then a statement like """A number such as {0:15.2f} can be formatted many ways: USA: {0:15,.2f), EU: {0:15<whatever>f}, India: {0:15<whatever>f), China {0:15<whatever>f)" would not be possible.
Why not allow extra flexibility? Unless the conversion is set by setting a global variable ala locale, the c-dict will be *used* field-by-field in each call to ob.__format__(fmt, conv), so there is no reason to force each call in a particular series to use the same conversion.
-1 Unattractive and unnecessary hyper-generalization. Raymond

Terry Reedy wrote:
Nick Coghlan wrote: A new conversion specifier should follow the current pattern and be a single letter, such as 'c' for 'custom' or 'd' for dict.
Because those characters already have other meanings in string formatting dictionary (as do many possible single digit codes). The suggested name "!conv" was chosen based on the existing localeconv() function name.
If, as I would expect, str.format scans left to right and interprets and replaces each field spec as it goes, then the above would not work. So put the conversion field before the fields it applies to.
I believe you're currently right - I'm not sure how hard it would be to change it to a two step process (parse the whole string first into an internal parse tree then go through and format each identified field). As for why I formatted the example the way I did: the {!conv} isn't all that interesting, since it just says "I accept a conventions dictionary". Having it at the front of the format string would give it to much prominence.
This, of course, makes string formatting stateful. With a 'shift lock' field added, an 'unshift' field should also be added. This, though, has the problem that a blank 'field-name' will in 3.1 either be auto-numbered or flagged as an error (if there are other explicitly numbered fields).
Aside from not producing any output, the !conv field would still have to obey all the rules for field naming/numbering. So if your format string used explicit numbering instead of auto-numbering then the !conv would need to be explicitly numbered as well. I agree that having "format fields which are not format fields" isn't ideal, but the alternative is likely to be something like yet-another-string-formatting-method which accepts a positional only conventions dictionary as its first argument. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Eric Smith wrote:
Antoine Pitrou wrote:
Raymond Hettinger <python@...> writes:
Am curious whether you guys like this proposal?
I find it good for the builtin format() function, but how does it work for str.format()?
I agree: I like it, but it's not enough. I use str.format() way more often than I hope to ever use builtin format(). If we make any change, I'd rather see it focused on the format mini-language.
I agree. My impression was that format() was added mostly for consistency with the policy of having a 'public' interface to special methods, and that .__format__ was added to support str.format. Hence, any new capability of .__format__ must be accessible from format strings with replacement fields. tjr

Terry Reedy wrote:
I agree. My impression was that format() was added mostly for consistency with the policy of having a 'public' interface to special methods, and that .__format__ was added to support str.format. Hence, any new capability of .__format__ must be accessible from format strings with replacement fields.
format() was also added because the PEP 3101 syntax is pretty heavyweight when it comes to formatting a single value: "%.2f" % (x) and "{0:.2f}".format(x) Being able to write format(".2f", x) instead meant dropping 4 characters (now 3 with str.format autonumbering) over the latter option. Agreed that any solution in this area needs to help with str.format() and not just format() though. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

On Wed, 18 Mar 2009 11:25:42 am Raymond Hettinger wrote:
Mark Dickinson's test code suggested a good, extensible approach to the problem. Here's the idea in a nutshell:
format(value, format_spec='', conventions=None) 'calls value.__format__(format_spec, conventions)'
For what was supposed to be a nice, simple way of formatting numbers, it sure became confusing. So thank you for the nutshell. I like this idea, especially if it means we can simplify the format_spec. Can we have the format_spec in a nutshell too?
Where conventions is an optional dictionary with formatting control values. Any value object can accept custom controls, but the names for standard ones would be taken from the standards provided by localeconv():
{ 'decimal_point': '.', 'grouping': [3, 0], 'negative_sign': '-', 'positive_sign': '', 'thousands_sep': ','}
Presumably we value compatibility with localeconv()? If not, then perhaps a better name for 'thousands_sep' is 'group_sep', on account that if you group by something other than 3 it won't represent thousands. Would this allow you to format a float like this? 1,234,567.89012 34567 89012 (group by threes for the integer part, and by fives for the fractional part). Or is that out-of-scope for this proposal? +1 for a conventions dict. Good plan! -- Steven D'Aprano
participants (7)
-
Antoine Pitrou
-
Carl Johnson
-
Eric Smith
-
Nick Coghlan
-
Raymond Hettinger
-
Steven D'Aprano
-
Terry Reedy