PEP 515: Underscores in Numeric Literals

Hey all, based on the feedback so far, I revised the PEP. There is now a much simpler rule for allowed underscores, with no exceptions. This made the grammar simpler as well. --------------------------------------------------------------------------- PEP: 515 Title: Underscores in Numeric Literals Version: $Revision$ Last-Modified: $Date$ Author: Georg Brandl Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 10-Feb-2016 Python-Version: 3.6 Abstract and Rationale ====================== This PEP proposes to extend Python's syntax so that underscores can be used in integral, floating-point and complex number literals. This is a common feature of other modern languages, and can aid readability of long literals, or literals whose value should clearly separate into parts, such as bytes or words in hexadecimal notation. Examples:: # grouping decimal numbers by thousands amount = 10_000_000.0 # grouping hexadecimal addresses by words addr = 0xDEAD_BEEF # grouping bits into bytes in a binary literal flags = 0b_0011_1111_0100_1110 # making the literal suffix stand out more imag = 1.247812376e-15_j Specification ============= The current proposal is to allow one or more consecutive underscores following digits and base specifiers in numeric literals. The production list for integer literals would therefore look like this:: integer: decimalinteger | octinteger | hexinteger | bininteger decimalinteger: nonzerodigit (digit | "_")* | "0" ("0" | "_")* nonzerodigit: "1"..."9" digit: "0"..."9" octinteger: "0" ("o" | "O") "_"* octdigit (octdigit | "_")* hexinteger: "0" ("x" | "X") "_"* hexdigit (hexdigit | "_")* bininteger: "0" ("b" | "B") "_"* bindigit (bindigit | "_")* octdigit: "0"..."7" hexdigit: digit | "a"..."f" | "A"..."F" bindigit: "0" | "1" For floating-point and complex literals:: floatnumber: pointfloat | exponentfloat pointfloat: [intpart] fraction | intpart "." exponentfloat: (intpart | pointfloat) exponent intpart: digit (digit | "_")* fraction: "." intpart exponent: ("e" | "E") ["+" | "-"] intpart imagnumber: (floatnumber | intpart) ("j" | "J") Alternative Syntax ================== Underscore Placement Rules -------------------------- Instead of the liberal rule specified above, the use of underscores could be limited. Common rules are (see the "other languages" section): * Only one consecutive underscore allowed, and only between digits. * Multiple consecutive underscore allowed, but only between digits. A less common rule would be to allow underscores only every N digits (where N could be 3 for decimal literals, or 4 for hexadecimal ones). This is unnecessarily restrictive, especially considering the separator placement is different in different cultures. Different Separators -------------------- A proposed alternate syntax was to use whitespace for grouping. Although strings are a precedent for combining adjoining literals, the behavior can lead to unexpected effects which are not possible with underscores. Also, no other language is known to use this rule, except for languages that generally disregard any whitespace. C++14 introduces apostrophes for grouping, which is not considered due to the conflict with Python's string literals. [1]_ Behavior in Other Languages =========================== Those languages that do allow underscore grouping implement a large variety of rules for allowed placement of underscores. This is a listing placing the known rules into three major groups. In cases where the language spec contradicts the actual behavior, the actual behavior is listed. **Group 1: liberal** This group is the least homogeneous: the rules vary slightly between languages. All of them allow trailing underscores. Some allow underscores after non-digits like the ``e`` or the sign in exponents. * D [2]_ * Perl 5 (underscores basically allowed anywhere, although docs say it's more restricted) [3]_ * Rust (allows between exponent sign and digits) [4]_ * Swift (although textual description says "between digits") [5]_ **Group 2: only between digits, multiple consecutive underscores** * C# (open proposal for 7.0) [6]_ * Java [7]_ **Group 3: only between digits, only one underscore** * Ada [8]_ * Julia (but not in the exponent part of floats) [9]_ * Ruby (docs say "anywhere", in reality only between digits) [10]_ Implementation ============== A preliminary patch that implements the specification given above has been posted to the issue tracker. [11]_ Open Questions ============== This PEP currently only proposes changing the literal syntax. The following extensions are open for discussion: * Allowing underscores in string arguments to the ``Decimal`` constructor. It could be argued that these are akin to literals, since there is no Decimal literal available (yet). * Allowing underscores in string arguments to ``int()`` with base argument 0, ``float()`` and ``complex()``. References ========== .. [1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3499.html .. [2] http://dlang.org/spec/lex.html#integerliteral .. [3] http://perldoc.perl.org/perldata.html#Scalar-value-constructors .. [4] http://doc.rust-lang.org/reference.html#number-literals .. [5] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift... .. [6] https://github.com/dotnet/roslyn/issues/216 .. [7] https://docs.oracle.com/javase/7/docs/technotes/guides/language/underscores-... .. [8] http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#2.4 .. [9] http://docs.julialang.org/en/release-0.4/manual/integers-and-floating-point-... .. [10] http://ruby-doc.org/core-2.3.0/doc/syntax/literals_rdoc.html#label-Numbers .. [11] http://bugs.python.org/issue26331 Copyright ========= This document has been placed in the public domain.

On Thu, Feb 11, 2016 at 7:22 PM, Georg Brandl <g.brandl@gmx.net> wrote:
* Allowing underscores in string arguments to the ``Decimal`` constructor. It could be argued that these are akin to literals, since there is no Decimal literal available (yet).
* Allowing underscores in string arguments to ``int()`` with base argument 0, ``float()`` and ``complex()``.
I'm -0.5 on both of these, with the caveat that if either gets done, both should be. Decimal() shouldn't be different from int() just because there's currently no way to express a Decimal literal; if Python 3.7 introduces such a literal, there'd be this weird rule difference that has to be maintained for backward compatibility, and has no justification left. (As a side point, I would be fully in favour of Decimal literals. I'd also be in favour of something like "from __future__ import fraction_literals" so 1/2 would evaluate to Fraction(1,2) rather than 0.5. Hence I'm inclined *not* to support underscores in Decimal().) ChrisA

On 11 February 2016 at 11:12, Chris Angelico <rosuav@gmail.com> wrote:
On Thu, Feb 11, 2016 at 7:22 PM, Georg Brandl <g.brandl@gmx.net> wrote:
The following extensions are open for discussion:
* Allowing underscores in string arguments to the ``Decimal`` constructor. It could be argued that these are akin to literals, since there is no Decimal literal available (yet).
* Allowing underscores in string arguments to ``int()`` with base argument 0, ``float()`` and ``complex()``.
I'm -0.5 on both of these, with the caveat that if either gets done, both should be. Decimal() shouldn't be different from int() just because there's currently no way to express a Decimal literal; if Python 3.7 introduces such a literal, there'd be this weird rule difference that has to be maintained for backward compatibility, and has no justification left.
I would be weakly in favour of all relevant constructors being updated to match the new syntax. The main reason is just consistency, and that the documentation already kind of guarantees that the literal syntax is supported (definitely for int and float; for complex it is too vague). To be consistent, the following minor extensions of the syntax should be allowed, which are not legal Python literals: int("0_001"), int("J_00", 20), float("0_001"), complex("0_001"). Maybe also with non-ASCII digits. However I tried writing Arabic-Indic digits (U+0600 etc) and my web browser split the number apart when I inserted an underscore. Maybe a right-to-left thing. But using Devangari digits U+0966, U+0967: int("१_०००") (= 1_000). Non-ASCII digits are apparently intentionally supported, but not documented: <https://bugs.python.org/issue10581>.
(As a side point, I would be fully in favour of Decimal literals. I'd also be in favour of something like "from __future__ import fraction_literals" so 1/2 would evaluate to Fraction(1,2) rather than 0.5. Hence I'm inclined *not* to support underscores in Decimal().)
Seems more like an argument to have the support in Decimal() consistent with float() etc, i.e. all or nothing.

On Feb 11, 2016, at 09:22 AM, Georg Brandl wrote:
based on the feedback so far, I revised the PEP. There is now a much simpler rule for allowed underscores, with no exceptions. This made the grammar simpler as well.
I'd be +1, but there's something missing from the PEP: what the underscores *mean*. You describe the syntax nicely, but not the semantics. From reading the examples, I'd guess that the underscores are semantically transparent, meaning that the resulting value is the same if you just removed the underscores and interpreted the resulting literal. Right or wrong, could you please add a paragraph explaining the meaning of the underscores? Cheers, -Barry

On 11Feb2016 0651, Barry Warsaw wrote:
On Feb 11, 2016, at 09:22 AM, Georg Brandl wrote:
based on the feedback so far, I revised the PEP. There is now a much simpler rule for allowed underscores, with no exceptions. This made the grammar simpler as well.
I'd be +1, but there's something missing from the PEP: what the underscores *mean*. You describe the syntax nicely, but not the semantics.
From reading the examples, I'd guess that the underscores are semantically transparent, meaning that the resulting value is the same if you just removed the underscores and interpreted the resulting literal.
Right or wrong, could you please add a paragraph explaining the meaning of the underscores?
Glad I kept reading the thread this far - just pretend I also wrote exactly the same thing as Barry. Cheers, Steve

On 02/11/2016 05:52 PM, Steve Dower wrote:
On 11Feb2016 0651, Barry Warsaw wrote:
On Feb 11, 2016, at 09:22 AM, Georg Brandl wrote:
based on the feedback so far, I revised the PEP. There is now a much simpler rule for allowed underscores, with no exceptions. This made the grammar simpler as well.
I'd be +1, but there's something missing from the PEP: what the underscores *mean*. You describe the syntax nicely, but not the semantics.
From reading the examples, I'd guess that the underscores are semantically transparent, meaning that the resulting value is the same if you just removed the underscores and interpreted the resulting literal.
Right or wrong, could you please add a paragraph explaining the meaning of the underscores?
Glad I kept reading the thread this far - just pretend I also wrote exactly the same thing as Barry.
D'oh :) I added (hopefully) clarifying wording. Thanks, Georg

On Feb 11, 2016, at 00:22, Georg Brandl <g.brandl@gmx.net> wrote:
Allowing underscores in string arguments to the ``Decimal`` constructor. It could be argued that these are akin to literals, since there is no Decimal literal available (yet).
I'm +1 on this. Partly for consistency (see below)--but also, one of the use cases for Decimal is when you need more precision than float, meaning you'll often have even more digits to separate.
* Allowing underscores in string arguments to ``int()`` with base argument 0, ``float()`` and ``complex()``.
+1, because these are actually defined in terms of literals. For example, under int, "Base 0 means to interpret exactly as a code literal". This isn't actually quite true, because "-2" is not an integer literal but is accepted here--but see float for an example that *is* rigorously defined, and still defers to literal syntax and semantics.

On Thu, 11 Feb 2016 at 00:23 Georg Brandl <g.brandl@gmx.net> wrote:
Hey all,
based on the feedback so far, I revised the PEP. There is now a much simpler rule for allowed underscores, with no exceptions. This made the grammar simpler as well.
---------------------------------------------------------------------------
PEP: 515 Title: Underscores in Numeric Literals Version: $Revision$ Last-Modified: $Date$ Author: Georg Brandl Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 10-Feb-2016 Python-Version: 3.6
Abstract and Rationale ======================
This PEP proposes to extend Python's syntax so that underscores can be used in integral, floating-point and complex number literals.
This is a common feature of other modern languages, and can aid readability of long literals, or literals whose value should clearly separate into parts, such as bytes or words in hexadecimal notation.
Examples::
# grouping decimal numbers by thousands amount = 10_000_000.0
# grouping hexadecimal addresses by words addr = 0xDEAD_BEEF
# grouping bits into bytes in a binary literal flags = 0b_0011_1111_0100_1110
# making the literal suffix stand out more imag = 1.247812376e-15_j
Specification =============
The current proposal is to allow one or more consecutive underscores following digits and base specifiers in numeric literals.
+1 from me. Nice and simple! And we can always update PEP 8 do disallow any usage that we deem ugly.

On 11.02.16 10:22, Georg Brandl wrote:
Abstract and Rationale ======================
This PEP proposes to extend Python's syntax so that underscores can be used in integral, floating-point and complex number literals.
This is a common feature of other modern languages, and can aid readability of long literals, or literals whose value should clearly separate into parts, such as bytes or words in hexadecimal notation.
I have strong preference for more strict and simpler rule, used by most other languages -- "only between two digits". Main arguments: 1. Simple rule is easier to understand, remember and recognize. I care not about the complexity of the implementation (there is no large difference), but about cognitive complexity. 2. Most languages use this rule. It is better to follow non-formal standard that invent the rule that differs from rules in every other language. This will help programmers that use multiple languages. I have provided an alternative patch and can provide an alternative PEP if it is needed.
The production list for integer literals would therefore look like this::
integer: decimalinteger | octinteger | hexinteger | bininteger decimalinteger: nonzerodigit (digit | "_")* | "0" ("0" | "_")* nonzerodigit: "1"..."9" digit: "0"..."9" octinteger: "0" ("o" | "O") "_"* octdigit (octdigit | "_")*
octinteger: "0" ("o" | "O") octdigit (["_"] octdigit)*
hexinteger: "0" ("x" | "X") "_"* hexdigit (hexdigit | "_")*
hexinteger: "0" ("x" | "X") hexdigit (["_"] hexdigit)*
bininteger: "0" ("b" | "B") "_"* bindigit (bindigit | "_")*
bininteger: "0" ("b" | "B") bindigit (["_"] bindigit)*
octdigit: "0"..."7" hexdigit: digit | "a"..."f" | "A"..."F" bindigit: "0" | "1"
For floating-point and complex literals::
floatnumber: pointfloat | exponentfloat pointfloat: [intpart] fraction | intpart "." exponentfloat: (intpart | pointfloat) exponent intpart: digit (digit | "_")*
intpart: digit (["_"] digit)*
fraction: "." intpart exponent: ("e" | "E") ["+" | "-"] intpart imagnumber: (floatnumber | intpart) ("j" | "J")
**Group 1: liberal**
This group is the least homogeneous: the rules vary slightly between languages. All of them allow trailing underscores. Some allow underscores after non-digits like the ``e`` or the sign in exponents.
* D [2]_ * Perl 5 (underscores basically allowed anywhere, although docs say it's more restricted) [3]_ * Rust (allows between exponent sign and digits) [4]_ * Swift (although textual description says "between digits") [5]_
**Group 2: only between digits, multiple consecutive underscores**
* C# (open proposal for 7.0) [6]_ * Java [7]_
**Group 3: only between digits, only one underscore**
* Ada [8]_ * Julia (but not in the exponent part of floats) [9]_ * Ruby (docs say "anywhere", in reality only between digits) [10]_
This classification is misleading. The difference between groups 2 and 3 is less then between different languages in group 1. To be fair, groups 2 and 3 should be united in one group. C++ should be included in this group. Perl 5 and Swift should be either included in both groups or excluded from any group, because they have inconsistencies between the documentation and the implementation or between different parts of the documentation. With correct classification it is obvious what variant is the most popular.

On 02/11/2016 10:50 AM, Serhiy Storchaka wrote:
I have strong preference for more strict and simpler rule, used by most other languages -- "only between two digits". Main arguments:
2. Most languages use this rule. It is better to follow non-formal standard that invent the rule that differs from rules in every other language. This will help programmers that use multiple languages.
If Python followed other languages in everything: 1) Python would not need to exist; and 2) Python would suck ;) If our rule is more permissive that other languages then cross-language developers can still use the same style in both languages, without penalizing those who want to use the extra freedom in Python. -- ~Ethan~

On 2/11/2016 11:01 AM, Ethan Furman wrote:
On 02/11/2016 10:50 AM, Serhiy Storchaka wrote:
I have strong preference for more strict and simpler rule, used by most other languages -- "only between two digits". Main arguments:
2. Most languages use this rule. It is better to follow non-formal standard that invent the rule that differs from rules in every other language. This will help programmers that use multiple languages.
If Python followed other languages in everything:
1) Python would not need to exist; and 2) Python would suck ;)
If our rule is more permissive that other languages then cross-language developers can still use the same style in both languages, without penalizing those who want to use the extra freedom in Python.
Ditto. If people need an idea to shoot down, regarding literal constants, and because I couldn't find a Python-Non-Ideas list to post this in, here is one. Note that it is unambiguous, does not conflict with existing binary literals, but otherwise sucks. Please vote this idea down with emphasis: Base 64 decoding literals: print( 0b64_CjMy_NTM0_Mjkw_NQ ) 325342905

On Thu, Feb 11, 2016 at 08:50:09PM +0200, Serhiy Storchaka wrote:
I have strong preference for more strict and simpler rule, used by most other languages -- "only between two digits". Main arguments:
1. Simple rule is easier to understand, remember and recognize. I care not about the complexity of the implementation (there is no large difference), but about cognitive complexity.
2. Most languages use this rule. It is better to follow non-formal standard that invent the rule that differs from rules in every other language. This will help programmers that use multiple languages.
I have provided an alternative patch and can provide an alternative PEP if it is needed.
I don't think an alternative PEP is needed, but I hope that your alternative gets a fair treatment in the PEP.
The production list for integer literals would therefore look like this::
integer: decimalinteger | octinteger | hexinteger | bininteger decimalinteger: nonzerodigit (digit | "_")* | "0" ("0" | "_")* nonzerodigit: "1"..."9" digit: "0"..."9" octinteger: "0" ("o" | "O") "_"* octdigit (octdigit | "_")*
octinteger: "0" ("o" | "O") octdigit (["_"] octdigit)*
hexinteger: "0" ("x" | "X") "_"* hexdigit (hexdigit | "_")*
hexinteger: "0" ("x" | "X") hexdigit (["_"] hexdigit)*
bininteger: "0" ("b" | "B") "_"* bindigit (bindigit | "_")*
bininteger: "0" ("b" | "B") bindigit (["_"] bindigit)*
To me, Serhiy's versions (starting with single > symbols) are not only simpler to learn, but have a simpler (or at least shorter) implementation too. [...]
**Group 3: only between digits, only one underscore**
* Ada [8]_ * Julia (but not in the exponent part of floats) [9]_ * Ruby (docs say "anywhere", in reality only between digits) [10]_
This classification is misleading. The difference between groups 2 and 3 is less then between different languages in group 1. To be fair, groups 2 and 3 should be united in one group. C++ should be included in this group. Perl 5 and Swift should be either included in both groups or excluded from any group, because they have inconsistencies between the documentation and the implementation or between different parts of the documentation.
With correct classification it is obvious what variant is the most popular.
It is not obvious to me what you think the correct classification is. If you disagree with Georg's classification, would you reclassify the languages, and if there is agreement that you are correct, he can update the PEP? -- Steve

On 2/11/2016 12:22 AM, Georg Brandl wrote:
Hey all,
based on the feedback so far, I revised the PEP. There is now a much simpler rule for allowed underscores, with no exceptions. This made the grammar simpler as well.
+1 overall
Examples::
# grouping decimal numbers by thousands amount = 10_000_000.0
# grouping hexadecimal addresses by words addr = 0xDEAD_BEEF
# grouping bits into bytes in a binary literal nybbles, not bytes, is shown... which is more readable, and does group into bytes also. flags = 0b_0011_1111_0100_1110
+1 on 0b_ and 0X_ and, especially, 0O_ (but why anyone would use uppercase base designators is beyond me, as it is definitely less readable)
# making the literal suffix stand out more imag = 1.247812376e-15_j
+1 on _j
participants (11)
-
Andrew Barnert
-
Barry Warsaw
-
Brett Cannon
-
Chris Angelico
-
Ethan Furman
-
Georg Brandl
-
Glenn Linderman
-
Martin Panter
-
Serhiy Storchaka
-
Steve Dower
-
Steven D'Aprano