Format mini-language for lakh and crore

In South Asia, a different style of digit delimiters for large numbers is used than in Europe, North America, Australia, etc. With some minor spelling differences, the term lakh is used for a hundred-thousand, and it is generally written as '1,00,000'. In turn, a crore is 100 lakh, and is written as '1,00,00,000'. Extending this pattern, larger numbers continue to use two digits in groups (other than the smallest grouping of three digits. So, e.g. 1e12 is written as 10,00,00,00,00,000. It's nice that we now have the optional underscore in numeric literals. So we could write a number as either `12_34_56_78_00_000` or `1_234_567_800_000` depending on what region of the world and which convention was more familiar. However, in *formatting* those numbers, the format mini-language only allows the European convention. So e.g. In [1]: x = 12_34_56_78_00_000 In [2]: "{:,d}".format(x) Out[2]: '1,234,567,800,000' In [3]: f"{x:,d}" Out[3]: '1,234,567,800,000' In order to get Indian number delimiters, you'd have to write a custom formatting function, notwithstanding that something like 1.5 billion people use the three-then-two delimiting convention. I propose that Python should have an additional grouping option, or some other way to specify this grouping convention. Oddly, the '_' grouping symbol is available, even though no one actually uses that grouper outside of programming languages like Python, e.g.: In [4]: f"{x:_d}" Out[4]: '1_234_567_800_000' I guess this is nice for something like round-tripping numbers used in code, but it's not a symbol anyone uses "natively" (I understand why comma or period cannot be used in numeric literals since they mean something else in Python already). I'm not sure what symbol or combination I would recommend, but finding something suitable shouldn't be so hard. Perhaps now that backtick no longer has any other meaning in Python, it could be used since it looks similar to a comma. E.g. in Python 3.8 we might have:
f"{x:`d}" '12,34,56,78,00,000'
(actually, this probably isn't any parser issue even in Python 2 since it's already inside quotes; but the issue is moot). Or maybe a two character version like:
f"{x:2,d}" '12,34,56,78,00,000'
Or:
f"{x:,,d}" '12,34,56,78,00,000'
Even if `2,` was used, that wouldn't preclude giving an additional length descriptor after it. Now we can have:
f"{x:,.2f}"
'1,234,567,800,000.00' Perhaps in the future this would work:
f"{x:2,.2f}" '12,34,56,78,00,000.00'
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Hi David, Perhaps the "n" locale-dependent number formatting specifier should accept a , to have locale-appropriate formatting of thousand separators? f"{x:,n}" would Do The Right Thing(TM) depending on the locale. Today it is an error. Stephan 2018-01-28 7:25 GMT+01:00 David Mertz <mertz@gnosis.cx>:

On 28 January 2018 at 19:30, Stephan Houben <stephanh42@gmail.com> wrote:
Checking https://www.python.org/dev/peps/pep-0378/, we did suggest using the locale module for cases where the engineering style groups-of-three structure wasn't appropriate, with the parallel being drawn to the fact that you also need to use locale dependent formatting to get a decimal separator other than ".". One nice aspect of this suggestion is that supplying the comma would map directly to the "grouping" parameter in https://docs.python.org/3/library/locale.html#locale.format: >>> import locale >>> locale.setlocale(locale.LC_ALL, "en_IN.utf8") 'en_IN.utf8' >>> locale.format("%d", 10e9, grouping=True) '10,00,00,00,000' Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 1/28/2018 6:51 AM, Nick Coghlan wrote:
If I recall correctly, we discussed this at the time, and the problem with locale is that it's not thread safe. I agree that if it were, it would be nice to be able to use it, either with 'n', or in some other mode just for grouping. The underlying C setlocale()/localeconv() just isn't very friendly to this use case. Eric.

On Sun, Jan 28, 2018 at 5:46 AM, Eric V. Smith <eric@trueblade.com> wrote:
POSIX.1-2008 added thread-local locales (say that 3x fast); see uselocale(3). This appears to be supported on Linux (since glibc 2.3, which is older than all supported enterprise distros), MacOS, and the BSDs, but not Windows. OTOH Windows, MacOS, and the BSDs all seem to provide the non-standard sprintf_l, which takes an explicit locale to use. So it looks like all mainstream OSes actually make it possible to use a specific locale to do arbitrary formatting in a thread-safe way. -n -- Nathaniel J. Smith -- https://vorpus.org

I actually didn't know about `locale.format("%d", 10e9, grouping=True)`. But it's still much less general than having the option in the f-string/.format() mini-language. This is really about the formatted string, not necessarily about the locale. So, e.g. I'd like to be able to write:
print(f"In European format x is {x:,.2f}, in Indian format it is {x:`.2f}")
I don't want the format necessarily to be some pseudo-global setting, even if it can get stored in thread-locals. That said, having a locale-aware symbol for delimiting numbers in the format mini-language would also not be a bad thing. On Sun, Jan 28, 2018 at 4:27 PM, Nathaniel Smith <njs@pobox.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Jan 28, 2018 at 5:31 PM, David Mertz <mertz@gnosis.cx> wrote:
I don't understand the format mini-language well enough to know what would fit in, but maybe some way to (a) request localified formatting, (b) some way to explicitly say which locale you want to use? Like if "h" means "human friendly", it might be something like: f"In the current locale x is {x:h.2f}, in Indian format it is {x:h(en_IN).2f}" -n -- Nathaniel J. Smith -- https://vorpus.org

On 29 January 2018 at 11:48, Nathaniel Smith <njs@pobox.com> wrote:
Given the example, I think a more useful approach would be to allow an optional digit grouping specifier after the comma separator, and allow the separator to be repeated to indicate non-uniform groupings in the lower order digits. If we did that, then David's example could become: >>> print(f"In European format x is {x:,.2f}, in Indian format it is {x:,2,3.2f}") The core elements of interpreting that would then be: - digit group size specifiers are permited for both "," (decimal display only) and "_" (all display bases) - if no digit group size specifier is given, it defaults to 3 for decimal and 4 for binary, octal, and hexadecimal - if multiple digit group specifiers are given, then the last one given is applied starting from the least significant integer digit so "{x:,2,3.2f}" means: - an arbitrary number of leading 2-digit groups - 1 group of 3 digits - 2 decimal places It would then be reasonably straightforward to use this as a lower level primitive to implement locale dependent formatting, as follows: - format in English using the locale's grouping rules [1] (either LC_NUMERIC.grouping or LC_MONETARY.mon_grouping, as appropriate) - use str.translate() [2] to replace "," and "." with the locale's thousands_sep & decimal_point or mon_thousands_sep & mon_decimal_point [1] https://docs.python.org/3/library/locale.html#locale.localeconv [2] https://docs.python.org/3/library/stdtypes.html#str.translate Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick suggests: >>> print(f"In European format x is {x:,.2f}, in Indian format it is {x:,2,3.2f}") This looks very good and general. I only know of the "European" and South Asian conventions in widespread use, but we could give other grouping conventions using that little syntax and it definitely covers the ones I know about. There's not an issue about this giving the parser for the format mini-language hiccups over width specifier in there, is there? On Mon, Jan 29, 2018 at 2:13 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On 30 January 2018 at 01:43, David Mertz <mertz@gnosis.cx> wrote:
That's the part I haven't explicitly checked in the code, but I think it would be feasible based on https://docs.python.org/3/library/string.html#format-specification-mini-lang... My proposal is essentially to replace the current: grouping_option ::= "_" | "," with: grouping_option ::= underscore_grouping | comma_grouping underscore_grouping ::= "_" [group_width ("_" group_width)*] comma_grouping ::= "," [group_width ("," group_width)*] group_width ::= digit+ That's unambiguous, since the grouping field still always starts with "_" or ",", and the next field must be either the precision (which always starts with "."), the type (which is always a letter, and never a number or symbol), or the closing brace for the field specifier. Cheers, Nick. P.S. While writing this I noticed that the current format mini-language docs are incorrect and say "integer" where they should be saying "digit+": https://bugs.python.org/issue32720 -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 1/29/2018 2:13 AM, Nick Coghlan wrote:
This just seems too complicated to me, and is overgeneralizing. How many of these different formats would ever really be used? Can you really expect someone to remember what that means by looking at it? If you are going to generalize it, at least go all the way and support the struct lconv "CHAR_MAX" behavior, too. However, I suggest just pick another character to use instead of ",", and have it mean the 2,3 format. With no evidence (and willing to be wrong), it seems like it's the next-most needed variety of this. Maybe use ";"? Eric

On 1 February 2018 at 08:14, Eric V. Smith <eric@trueblade.com> wrote:
Sure - "," and "_" both mean "digit grouping", the numbers tell you how large the groups are from left to right (with the leftmost group size repeated as needed), and a single "," means the same thing as ",3," for decimal digits, and the same thing as ",4," for binary, octal, and hexadecimal digits. Another advantage of this approach is that we'd be able to control the grouping of binary and hexadecimal numbers printed with "_", rather than the current approach where we're restricted to half-byte grouping for binary and 16-bit word grouping for hex.
If you are going to generalize it, at least go all the way and support the struct lconv "CHAR_MAX" behavior, too.
I'm not sure how common that convention is, but if we wanted to support it then I'd spell it by repeating the group separator without an intervening group size specifier: {x:,,2,3.2f} # Ungrouped leading digits, single group of 2, single group of 3
That's even more arbitrary and hard to interpret than listing out the grouping spec, though. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jan 31, 2018 8:12 PM, "Nick Coghlan" <ncoghlan@gmail.com> wrote: On 1 February 2018 at 08:14, Eric V. Smith <eric@trueblade.com> wrote:
print(f"In European format x is {x:,.2f}, in Indian format it is {x:,2,3.2f}")
That's even more arbitrary and hard to interpret than listing out the grouping spec, though. I suggested a single character, although my thought of backtick was different from Eric's of semicolon. Neither of them would be obvious, but rather "something to look up the first few times." There is a lot in the format mini-language that is "have to look up" though. A single character South Asian number delimiter style wouldn't be different from a lot of features of that DSL. Albeit, most of it seems intuitive after you've used it a while... The symbols are somewhat iconic. I think if we only cared about decimal digit groups (which is all I initially thought of), Nick's would be excessive generalization. However, when you think of also grouping hex, octal, and binary, there genuinely are several conventions and different useful presentations. So overall I do like Nick's approach better than my initial suggestion or Eric's one that is similar to mine.

On 2/1/2018 12:05 AM, David Mertz wrote:
So overall I do like Nick's approach better than my initial suggestion or Eric's one that is similar to mine.
Oops, I'd forgotten that you (David) had proposed a single character in your original email. I'm not trying to claim the idea! The important part is that it's a single character, not what that character is, so I'll refer to it as "David's suggestion"! FWIW, PEP 378 also summarizes some of the discussion we're rehashing, except Nick's proposal (the chosen one) was simpler, and mine slightly more complex, but still not generalized to solve the problem being discussed here. Nine years on it might be worth doing some research to see if other languages have done anything since the PEP was written. For the languages that support picture-style formatting, I suspect not, but maybe there's something to learn from newer languages? Eric

On 1 February 2018 at 14:11, Nick Coghlan <ncoghlan@gmail.com> wrote:
Slight correction here, since the comma-separator is decimal only: - "," would be short for ",3," with decimal digits - "_" would be short for "_3_" with decimal digits - "_" would be short for "_4_" with binary/octal/hexadecimal digits Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

What about something like: f"{x:₹d}" ₹ = Indian Rupees symbol I realize it is not ASCII but ₹ would be, for the target audience, both easy to type (Ctrl-Alt-4 on Windows Indian English keyboard layout) and be mnemonic ("format number like you would format an amount in rupees"). Stephan 2018-02-01 6:11 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:

On Feb 1, 2018 12:17 AM, "Stephan Houben" <stephanh42@gmail.com> wrote: What about something like: f"{x:₹d}" ₹ = Indian Rupees symbol I realize it is not ASCII but ₹ would be, for the target audience, both easy to type and be mnemonic ("format number like you would format an amount in rupees"). I like how iconic it is very much. However... There are two obstacles to this. The main one is the BDFLs often declared opposition to using non-ASCII in Python itself. The format mini-language is borderline between being part of the Python language and merely being strings you can quote (which strongly need to allow non-ASCII literals). It's a lot like regex or glob this way (but for historic reasons at least, both those are also pure ASCII in their syntax elements, but can obviously match non-ASCII literals or classes). The other element is that not all of South Asia is India. U+09F3 <http://www.fileformat.info/info/unicode/char/09f3/index.htm> BENGALI RUPEE SIGN ৳ U+0AF1 <http://www.fileformat.info/info/unicode/char/0af1/index.htm> GUJARATI RUPEE SIGN ૱ U+0BF9 <http://www.fileformat.info/info/unicode/char/0bf9/index.htm> TAMIL RUPEE SIGN ௹I believe U+02A8 ₨ is deprecated in India but still used in Pakistan. However, the discussion also let me to find your on Wikipedia, which also urged towards Nick's more general pattern specifier: Outside of Taiwan, digits are sometimes grouped by myriads instead of thousands. Hence it is more convenient to think of numbers here as in groups of four, thus 1,234,567,890 is regrouped here as 12,3456,7890. Larger than a myriad, each number is therefore four zeroes longer than the one before it,

On 2 February 2018 at 08:23, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I wasn't even aware the restriction existed until this thread (it's one of those "I'd never tried it, so I didn't know it was prohibited" cases). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Someone would have to check, but presumably the CRT on Windows is converting the natively thread-local locale into a process-wide locale for POSIX compatibility, which means it can probably be easily bypassed without having to use specific overloads. Top-posted from my Windows phone From: Nathaniel Smith Sent: Monday, January 29, 2018 11:29 To: Eric V. Smith Cc: python-ideas Subject: Re: [Python-ideas] Format mini-language for lakh and crore On Sun, Jan 28, 2018 at 5:46 AM, Eric V. Smith <eric@trueblade.com> wrote:
POSIX.1-2008 added thread-local locales (say that 3x fast); see uselocale(3). This appears to be supported on Linux (since glibc 2.3, which is older than all supported enterprise distros), MacOS, and the BSDs, but not Windows. OTOH Windows, MacOS, and the BSDs all seem to provide the non-standard sprintf_l, which takes an explicit locale to use. So it looks like all mainstream OSes actually make it possible to use a specific locale to do arbitrary formatting in a thread-safe way. -n -- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Sun, Jan 28, 2018 at 09:51:05PM +1000, Nick Coghlan wrote:
Or you could use string replacement: py> format(123456789.25, "0,.3f").replace(',', '_').replace('.', '·') '123_456_789·250' which may not be quite as convenient or efficient, and it doesn't work where groups-of-three aren't appropriate. But on the other hand using locale has a number of disadvantages too: - quoting PEP 378: "Finance users and non-professional programmers find the locale approach to be frustrating, arcane and non-obvious". https://www.python.org/dev/peps/pep-0378/ - the available locales and their spellings are OS dependent; - its not cheap, thread-safe or local to your library/function. The documentation for locale warns: "It is generally a bad idea to call setlocale() in some library routine, since as a side effect it affects the entire program. Saving and restoring it is almost as bad: it is expensive and affects other threads that happen to run before the settings have been restored." -- Steve

It’s my opinion that instead of adding syntax, we should instead encourage using number formatting library functions. * You can replace the function or have the function dispatch differently depending on locale * It means that syntax doesn’t need to be extended for every use case – its easier to replace a function than change syntax. From: Python-ideas [mailto:python-ideas-bounces+tritium-list=sdamon.com@python.org] On Behalf Of David Mertz Sent: Sunday, January 28, 2018 1:25 AM To: python-ideas <python-ideas@python.org> Subject: [Python-ideas] Format mini-language for lakh and crore In South Asia, a different style of digit delimiters for large numbers is used than in Europe, North America, Australia, etc. With some minor spelling differences, the term lakh is used for a hundred-thousand, and it is generally written as '1,00,000'. In turn, a crore is 100 lakh, and is written as '1,00,00,000'. Extending this pattern, larger numbers continue to use two digits in groups (other than the smallest grouping of three digits. So, e.g. 1e12 is written as 10,00,00,00,00,000. It's nice that we now have the optional underscore in numeric literals. So we could write a number as either `12_34_56_78_00_000` or `1_234_567_800_000` depending on what region of the world and which convention was more familiar. However, in *formatting* those numbers, the format mini-language only allows the European convention. So e.g. In [1]: x = 12_34_56_78_00_000 In [2]: "{:,d}".format(x) Out[2]: '1,234,567,800,000' In [3]: f"{x:,d}" Out[3]: '1,234,567,800,000' In order to get Indian number delimiters, you'd have to write a custom formatting function, notwithstanding that something like 1.5 billion people use the three-then-two delimiting convention. I propose that Python should have an additional grouping option, or some other way to specify this grouping convention. Oddly, the '_' grouping symbol is available, even though no one actually uses that grouper outside of programming languages like Python, e.g.: In [4]: f"{x:_d}" Out[4]: '1_234_567_800_000' I guess this is nice for something like round-tripping numbers used in code, but it's not a symbol anyone uses "natively" (I understand why comma or period cannot be used in numeric literals since they mean something else in Python already). I'm not sure what symbol or combination I would recommend, but finding something suitable shouldn't be so hard. Perhaps now that backtick no longer has any other meaning in Python, it could be used since it looks similar to a comma. E.g. in Python 3.8 we might have:
f"{x:`d}" '12,34,56,78,00,000'
(actually, this probably isn't any parser issue even in Python 2 since it's already inside quotes; but the issue is moot). Or maybe a two character version like:
f"{x:2,d}"
'12,34,56,78,00,000' Or:
f"{x:,,d}"
'12,34,56,78,00,000' Even if `2,` was used, that wouldn't preclude giving an additional length descriptor after it. Now we can have:
f"{x:,.2f}"
'1,234,567,800,000.00' Perhaps in the future this would work:
f"{x:2,.2f}"
'12,34,56,78,00,000.00' -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

Hi David, Perhaps the "n" locale-dependent number formatting specifier should accept a , to have locale-appropriate formatting of thousand separators? f"{x:,n}" would Do The Right Thing(TM) depending on the locale. Today it is an error. Stephan 2018-01-28 7:25 GMT+01:00 David Mertz <mertz@gnosis.cx>:

On 28 January 2018 at 19:30, Stephan Houben <stephanh42@gmail.com> wrote:
Checking https://www.python.org/dev/peps/pep-0378/, we did suggest using the locale module for cases where the engineering style groups-of-three structure wasn't appropriate, with the parallel being drawn to the fact that you also need to use locale dependent formatting to get a decimal separator other than ".". One nice aspect of this suggestion is that supplying the comma would map directly to the "grouping" parameter in https://docs.python.org/3/library/locale.html#locale.format: >>> import locale >>> locale.setlocale(locale.LC_ALL, "en_IN.utf8") 'en_IN.utf8' >>> locale.format("%d", 10e9, grouping=True) '10,00,00,00,000' Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 1/28/2018 6:51 AM, Nick Coghlan wrote:
If I recall correctly, we discussed this at the time, and the problem with locale is that it's not thread safe. I agree that if it were, it would be nice to be able to use it, either with 'n', or in some other mode just for grouping. The underlying C setlocale()/localeconv() just isn't very friendly to this use case. Eric.

On Sun, Jan 28, 2018 at 5:46 AM, Eric V. Smith <eric@trueblade.com> wrote:
POSIX.1-2008 added thread-local locales (say that 3x fast); see uselocale(3). This appears to be supported on Linux (since glibc 2.3, which is older than all supported enterprise distros), MacOS, and the BSDs, but not Windows. OTOH Windows, MacOS, and the BSDs all seem to provide the non-standard sprintf_l, which takes an explicit locale to use. So it looks like all mainstream OSes actually make it possible to use a specific locale to do arbitrary formatting in a thread-safe way. -n -- Nathaniel J. Smith -- https://vorpus.org

I actually didn't know about `locale.format("%d", 10e9, grouping=True)`. But it's still much less general than having the option in the f-string/.format() mini-language. This is really about the formatted string, not necessarily about the locale. So, e.g. I'd like to be able to write:
print(f"In European format x is {x:,.2f}, in Indian format it is {x:`.2f}")
I don't want the format necessarily to be some pseudo-global setting, even if it can get stored in thread-locals. That said, having a locale-aware symbol for delimiting numbers in the format mini-language would also not be a bad thing. On Sun, Jan 28, 2018 at 4:27 PM, Nathaniel Smith <njs@pobox.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On Sun, Jan 28, 2018 at 5:31 PM, David Mertz <mertz@gnosis.cx> wrote:
I don't understand the format mini-language well enough to know what would fit in, but maybe some way to (a) request localified formatting, (b) some way to explicitly say which locale you want to use? Like if "h" means "human friendly", it might be something like: f"In the current locale x is {x:h.2f}, in Indian format it is {x:h(en_IN).2f}" -n -- Nathaniel J. Smith -- https://vorpus.org

On 29 January 2018 at 11:48, Nathaniel Smith <njs@pobox.com> wrote:
Given the example, I think a more useful approach would be to allow an optional digit grouping specifier after the comma separator, and allow the separator to be repeated to indicate non-uniform groupings in the lower order digits. If we did that, then David's example could become: >>> print(f"In European format x is {x:,.2f}, in Indian format it is {x:,2,3.2f}") The core elements of interpreting that would then be: - digit group size specifiers are permited for both "," (decimal display only) and "_" (all display bases) - if no digit group size specifier is given, it defaults to 3 for decimal and 4 for binary, octal, and hexadecimal - if multiple digit group specifiers are given, then the last one given is applied starting from the least significant integer digit so "{x:,2,3.2f}" means: - an arbitrary number of leading 2-digit groups - 1 group of 3 digits - 2 decimal places It would then be reasonably straightforward to use this as a lower level primitive to implement locale dependent formatting, as follows: - format in English using the locale's grouping rules [1] (either LC_NUMERIC.grouping or LC_MONETARY.mon_grouping, as appropriate) - use str.translate() [2] to replace "," and "." with the locale's thousands_sep & decimal_point or mon_thousands_sep & mon_decimal_point [1] https://docs.python.org/3/library/locale.html#locale.localeconv [2] https://docs.python.org/3/library/stdtypes.html#str.translate Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick suggests: >>> print(f"In European format x is {x:,.2f}, in Indian format it is {x:,2,3.2f}") This looks very good and general. I only know of the "European" and South Asian conventions in widespread use, but we could give other grouping conventions using that little syntax and it definitely covers the ones I know about. There's not an issue about this giving the parser for the format mini-language hiccups over width specifier in there, is there? On Mon, Jan 29, 2018 at 2:13 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.

On 30 January 2018 at 01:43, David Mertz <mertz@gnosis.cx> wrote:
That's the part I haven't explicitly checked in the code, but I think it would be feasible based on https://docs.python.org/3/library/string.html#format-specification-mini-lang... My proposal is essentially to replace the current: grouping_option ::= "_" | "," with: grouping_option ::= underscore_grouping | comma_grouping underscore_grouping ::= "_" [group_width ("_" group_width)*] comma_grouping ::= "," [group_width ("," group_width)*] group_width ::= digit+ That's unambiguous, since the grouping field still always starts with "_" or ",", and the next field must be either the precision (which always starts with "."), the type (which is always a letter, and never a number or symbol), or the closing brace for the field specifier. Cheers, Nick. P.S. While writing this I noticed that the current format mini-language docs are incorrect and say "integer" where they should be saying "digit+": https://bugs.python.org/issue32720 -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 1/29/2018 2:13 AM, Nick Coghlan wrote:
This just seems too complicated to me, and is overgeneralizing. How many of these different formats would ever really be used? Can you really expect someone to remember what that means by looking at it? If you are going to generalize it, at least go all the way and support the struct lconv "CHAR_MAX" behavior, too. However, I suggest just pick another character to use instead of ",", and have it mean the 2,3 format. With no evidence (and willing to be wrong), it seems like it's the next-most needed variety of this. Maybe use ";"? Eric

On 1 February 2018 at 08:14, Eric V. Smith <eric@trueblade.com> wrote:
Sure - "," and "_" both mean "digit grouping", the numbers tell you how large the groups are from left to right (with the leftmost group size repeated as needed), and a single "," means the same thing as ",3," for decimal digits, and the same thing as ",4," for binary, octal, and hexadecimal digits. Another advantage of this approach is that we'd be able to control the grouping of binary and hexadecimal numbers printed with "_", rather than the current approach where we're restricted to half-byte grouping for binary and 16-bit word grouping for hex.
If you are going to generalize it, at least go all the way and support the struct lconv "CHAR_MAX" behavior, too.
I'm not sure how common that convention is, but if we wanted to support it then I'd spell it by repeating the group separator without an intervening group size specifier: {x:,,2,3.2f} # Ungrouped leading digits, single group of 2, single group of 3
That's even more arbitrary and hard to interpret than listing out the grouping spec, though. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Jan 31, 2018 8:12 PM, "Nick Coghlan" <ncoghlan@gmail.com> wrote: On 1 February 2018 at 08:14, Eric V. Smith <eric@trueblade.com> wrote:
print(f"In European format x is {x:,.2f}, in Indian format it is {x:,2,3.2f}")
That's even more arbitrary and hard to interpret than listing out the grouping spec, though. I suggested a single character, although my thought of backtick was different from Eric's of semicolon. Neither of them would be obvious, but rather "something to look up the first few times." There is a lot in the format mini-language that is "have to look up" though. A single character South Asian number delimiter style wouldn't be different from a lot of features of that DSL. Albeit, most of it seems intuitive after you've used it a while... The symbols are somewhat iconic. I think if we only cared about decimal digit groups (which is all I initially thought of), Nick's would be excessive generalization. However, when you think of also grouping hex, octal, and binary, there genuinely are several conventions and different useful presentations. So overall I do like Nick's approach better than my initial suggestion or Eric's one that is similar to mine.

On 2/1/2018 12:05 AM, David Mertz wrote:
So overall I do like Nick's approach better than my initial suggestion or Eric's one that is similar to mine.
Oops, I'd forgotten that you (David) had proposed a single character in your original email. I'm not trying to claim the idea! The important part is that it's a single character, not what that character is, so I'll refer to it as "David's suggestion"! FWIW, PEP 378 also summarizes some of the discussion we're rehashing, except Nick's proposal (the chosen one) was simpler, and mine slightly more complex, but still not generalized to solve the problem being discussed here. Nine years on it might be worth doing some research to see if other languages have done anything since the PEP was written. For the languages that support picture-style formatting, I suspect not, but maybe there's something to learn from newer languages? Eric

On 1 February 2018 at 14:11, Nick Coghlan <ncoghlan@gmail.com> wrote:
Slight correction here, since the comma-separator is decimal only: - "," would be short for ",3," with decimal digits - "_" would be short for "_3_" with decimal digits - "_" would be short for "_4_" with binary/octal/hexadecimal digits Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

What about something like: f"{x:₹d}" ₹ = Indian Rupees symbol I realize it is not ASCII but ₹ would be, for the target audience, both easy to type (Ctrl-Alt-4 on Windows Indian English keyboard layout) and be mnemonic ("format number like you would format an amount in rupees"). Stephan 2018-02-01 6:11 GMT+01:00 Nick Coghlan <ncoghlan@gmail.com>:

On Feb 1, 2018 12:17 AM, "Stephan Houben" <stephanh42@gmail.com> wrote: What about something like: f"{x:₹d}" ₹ = Indian Rupees symbol I realize it is not ASCII but ₹ would be, for the target audience, both easy to type and be mnemonic ("format number like you would format an amount in rupees"). I like how iconic it is very much. However... There are two obstacles to this. The main one is the BDFLs often declared opposition to using non-ASCII in Python itself. The format mini-language is borderline between being part of the Python language and merely being strings you can quote (which strongly need to allow non-ASCII literals). It's a lot like regex or glob this way (but for historic reasons at least, both those are also pure ASCII in their syntax elements, but can obviously match non-ASCII literals or classes). The other element is that not all of South Asia is India. U+09F3 <http://www.fileformat.info/info/unicode/char/09f3/index.htm> BENGALI RUPEE SIGN ৳ U+0AF1 <http://www.fileformat.info/info/unicode/char/0af1/index.htm> GUJARATI RUPEE SIGN ૱ U+0BF9 <http://www.fileformat.info/info/unicode/char/0bf9/index.htm> TAMIL RUPEE SIGN ௹I believe U+02A8 ₨ is deprecated in India but still used in Pakistan. However, the discussion also let me to find your on Wikipedia, which also urged towards Nick's more general pattern specifier: Outside of Taiwan, digits are sometimes grouped by myriads instead of thousands. Hence it is more convenient to think of numbers here as in groups of four, thus 1,234,567,890 is regrouped here as 12,3456,7890. Larger than a myriad, each number is therefore four zeroes longer than the one before it,

On 2 February 2018 at 08:23, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I wasn't even aware the restriction existed until this thread (it's one of those "I'd never tried it, so I didn't know it was prohibited" cases). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Someone would have to check, but presumably the CRT on Windows is converting the natively thread-local locale into a process-wide locale for POSIX compatibility, which means it can probably be easily bypassed without having to use specific overloads. Top-posted from my Windows phone From: Nathaniel Smith Sent: Monday, January 29, 2018 11:29 To: Eric V. Smith Cc: python-ideas Subject: Re: [Python-ideas] Format mini-language for lakh and crore On Sun, Jan 28, 2018 at 5:46 AM, Eric V. Smith <eric@trueblade.com> wrote:
POSIX.1-2008 added thread-local locales (say that 3x fast); see uselocale(3). This appears to be supported on Linux (since glibc 2.3, which is older than all supported enterprise distros), MacOS, and the BSDs, but not Windows. OTOH Windows, MacOS, and the BSDs all seem to provide the non-standard sprintf_l, which takes an explicit locale to use. So it looks like all mainstream OSes actually make it possible to use a specific locale to do arbitrary formatting in a thread-safe way. -n -- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Sun, Jan 28, 2018 at 09:51:05PM +1000, Nick Coghlan wrote:
Or you could use string replacement: py> format(123456789.25, "0,.3f").replace(',', '_').replace('.', '·') '123_456_789·250' which may not be quite as convenient or efficient, and it doesn't work where groups-of-three aren't appropriate. But on the other hand using locale has a number of disadvantages too: - quoting PEP 378: "Finance users and non-professional programmers find the locale approach to be frustrating, arcane and non-obvious". https://www.python.org/dev/peps/pep-0378/ - the available locales and their spellings are OS dependent; - its not cheap, thread-safe or local to your library/function. The documentation for locale warns: "It is generally a bad idea to call setlocale() in some library routine, since as a side effect it affects the entire program. Saving and restoring it is almost as bad: it is expensive and affects other threads that happen to run before the settings have been restored." -- Steve

It’s my opinion that instead of adding syntax, we should instead encourage using number formatting library functions. * You can replace the function or have the function dispatch differently depending on locale * It means that syntax doesn’t need to be extended for every use case – its easier to replace a function than change syntax. From: Python-ideas [mailto:python-ideas-bounces+tritium-list=sdamon.com@python.org] On Behalf Of David Mertz Sent: Sunday, January 28, 2018 1:25 AM To: python-ideas <python-ideas@python.org> Subject: [Python-ideas] Format mini-language for lakh and crore In South Asia, a different style of digit delimiters for large numbers is used than in Europe, North America, Australia, etc. With some minor spelling differences, the term lakh is used for a hundred-thousand, and it is generally written as '1,00,000'. In turn, a crore is 100 lakh, and is written as '1,00,00,000'. Extending this pattern, larger numbers continue to use two digits in groups (other than the smallest grouping of three digits. So, e.g. 1e12 is written as 10,00,00,00,00,000. It's nice that we now have the optional underscore in numeric literals. So we could write a number as either `12_34_56_78_00_000` or `1_234_567_800_000` depending on what region of the world and which convention was more familiar. However, in *formatting* those numbers, the format mini-language only allows the European convention. So e.g. In [1]: x = 12_34_56_78_00_000 In [2]: "{:,d}".format(x) Out[2]: '1,234,567,800,000' In [3]: f"{x:,d}" Out[3]: '1,234,567,800,000' In order to get Indian number delimiters, you'd have to write a custom formatting function, notwithstanding that something like 1.5 billion people use the three-then-two delimiting convention. I propose that Python should have an additional grouping option, or some other way to specify this grouping convention. Oddly, the '_' grouping symbol is available, even though no one actually uses that grouper outside of programming languages like Python, e.g.: In [4]: f"{x:_d}" Out[4]: '1_234_567_800_000' I guess this is nice for something like round-tripping numbers used in code, but it's not a symbol anyone uses "natively" (I understand why comma or period cannot be used in numeric literals since they mean something else in Python already). I'm not sure what symbol or combination I would recommend, but finding something suitable shouldn't be so hard. Perhaps now that backtick no longer has any other meaning in Python, it could be used since it looks similar to a comma. E.g. in Python 3.8 we might have:
f"{x:`d}" '12,34,56,78,00,000'
(actually, this probably isn't any parser issue even in Python 2 since it's already inside quotes; but the issue is moot). Or maybe a two character version like:
f"{x:2,d}"
'12,34,56,78,00,000' Or:
f"{x:,,d}"
'12,34,56,78,00,000' Even if `2,` was used, that wouldn't preclude giving an additional length descriptor after it. Now we can have:
f"{x:,.2f}"
'1,234,567,800,000.00' Perhaps in the future this would work:
f"{x:2,.2f}"
'12,34,56,78,00,000.00' -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
participants (9)
-
Alex Walters
-
David Mertz
-
Eric V. Smith
-
Greg Ewing
-
Nathaniel Smith
-
Nick Coghlan
-
Stephan Houben
-
Steve Dower
-
Steven D'Aprano