[Python-3000] Updated PEP: Integer literal syntax and radices (was octal/binary discussion)

Mon Mar 19 06:14:12 CET 2007

The update includes issues discussed to date, plus the support of
uppercase on input of binary and hex, e.g. '0O123'.

It was pointed out to me that, since I suggested upper/lowercase was
an issue for another PEP, removal of uppercase octal/binary belonged
in that same PEP, if anybody cares enough to write it.  (It seems that
style guides and user preference will lead most people to write '0o'
instead of '0O', so perhaps there is no compelling need.)

Abstract
========

This PEP proposes changes to the Python core to rationalize
the treatment of string literal representations of integers
in different radices (bases).  These changes are targeted at
Python 3.0, but the backward-compatible parts of the changes
should be added to Python 2.6, so that all valid 3.0 integer
literals will also be valid in 2.6.

The proposal is that:

   a) octal literals must now be specified
      with a leading "0o" or "0O" instead of "0";

   b) binary literals are now supported via a
      leading "0b" or "0B"; and

   c) provision will be made for binary numbers in
      string formatting.

Motivation
==========

This PEP was motivated by two different issues:

    - The default octal representation of integers is silently confusing
      to people unfamiliar with C-like languages.  It is extremely easy
      to inadvertently create an integer object with the wrong value,
      because '013' means 'decimal 11', not 'decimal 13', to the Python
      language itself, which is not the meaning that most humans would
      assign to this literal.

    - Some Python users have a strong desire for binary support in
      the language.

Specification
=============

Grammar specification
---------------------

The grammar will be changed.  For Python 2.6, the changed and
new token definitions will be::

     integer        ::=     decimalinteger | octinteger | hexinteger |
                            bininteger | oldoctinteger

     octinteger     ::=     "0" ("o" | "O") octdigit+

     bininteger     ::=     "0" ("b" | "B") bindigit+

     oldoctinteger  ::=     "0" octdigit+

     bindigit       ::=     "0" | "1"

For Python 3.0, "oldoctinteger" will not be supported, and
an exception will be raised if a literal has a leading "0" and
a second character which is a digit.

For both versions, this will require changes to PyLong_FromString
as well as the grammar.

The documentation will have to be changed as well:  grammar.txt,
as well as the integer literal section of the reference manual.

PEP 306 should be checked for other issues, and that PEP should
be updated if the procedure described therein is insufficient.

int() specification
--------------------

int(s, 0) will also match the new grammar definition.

This should happen automatically with the changes to
PyLong_FromString required for the grammar change.

Also the documentation for int() should be changed to explain
that int(s) operates identically to int(s, 10), and the word
"guess" should be removed from the description of int(s, 0).

long() specification
--------------------

For Python 2.6, the long() implementation and documentation
should be changed to reflect the new grammar.

Tokenizer exception handling
----------------------------

If an invalid token contains a leading "0", the exception
error message should be more informative than the current
"SyntaxError: invalid token".  It should explain that decimal
numbers may not have a leading zero, and that octal numbers
require an "o" after the leading zero.

int() exception handling
------------------------

The ValueError raised for any call to int() with a string
should at least explicitly contain the base in the error
message, e.g.::

    ValueError: invalid literal for base 8 int(): 09

oct() function
---------------

oct() should be updated to output '0o' in front of
the octal digits (for 3.0, and 2.6 compatibility mode).

Output formatting
-----------------

The string (and unicode in 2.6) % operator will have
'b' format specifier added for binary, and the alternate
syntax of the 'o' option will need to be updated to
add '0o' in front, instead of '0'.

PEP 3101 already supports 'b' for binary output.

Transition from 2.6 to 3.0
---------------------------

The 2to3 translator will have to insert 'o' into any
octal string literal.

The Py3K compatible option to Python 2.6 should cause
attempts to use oldoctinteger literals to raise an
exception.

Rationale
=========

Most of the discussion on these issues occurred on the Python-3000
mailing list starting 14-Mar-2007, prompted by an observation that
the average human being would be completely mystified upon finding
that prepending a "0" to a string of digits changes the meaning of
that digit string entirely.

It was pointed out during this discussion that a similar, but shorter,
discussion on the subject occurred in January of 2006, prompted by a
discovery of the same issue.

Background
----------

For historical reasons, Python's string representation of integers
in different bases (radices), for string formatting and token
literals, borrows heavily from C.  [1]_ [2]_ Usage has shown that
the historical method of specifying an octal number is confusing,
and also that it would be nice to have additional support for binary
literals.

Throughout this document, unless otherwise noted, discussions about
the string representation of integers relate to these features:

    - Literal integer tokens, as used by normal module compilation,
      by eval(), and by int(token, 0).  (int(token) and int(token, 2-36)
      are not modified by this proposal.)

           * Under 2.6, long() is treated the same as int()

    - Formatting of integers into strings, either via the % string
      operator or the new PEP 3101 advanced string formatting method.

It is presumed that:

    - All of these features should have an identical set
      of supported radices, for consistency.

    - Python source code syntax and int(mystring, 0) should
      continue to share identical behavior.

Removal of old octal syntax
----------------------------

This PEP proposes that the ability to specify an octal number by
using a leading zero will be removed from the language in Python 3.0
(and the Python 3.0 preview mode of 2.6), and that a SyntaxError will
be raised whenever a leading "0" is immediately followed by another
digit.

During the present discussion, it was almost universally agreed that::

    eval('010') == 8

should no longer be true, because that is confusing to new users.
It was also proposed that::

    eval('0010') == 10

should become true, but that is much more contentious, because it is so
inconsistent with usage in other computer languages that mistakes are
likely to be made.

Almost all currently popular computer languages, including C/C++,
Java, Perl, and JavaScript, treat a sequence of digits with a
leading zero as an octal number.  Proponents of treating these
numbers as decimal instead have a very valid point -- as discussed
in `Supported radices`_, below, the entire non-computer world uses
decimal numbers almost exclusively.  There is ample anecdotal
evidence that many people are dismayed and confused if they
are confronted with non-decimal radices.

However, in most situations, most people do not write gratuitous
zeros in front of their decimal numbers.  The primary exception is
when an attempt is being made to line up columns of numbers.  But
since PEP 8 specifically discourages the use of spaces to try to
align Python code, one would suspect the same argument should apply
to the use of leading zeros for the same purpose.

Finally, although the email discussion often focused on whether anybody
actually *uses* octal any more, and whether we should cater to those
old-timers in any case, that is almost entirely besides the point.

Assume the rare complete newcomer to computing who *does*, either
occasionally or as a matter of habit, use leading zeros for decimal
numbers.  Python could either:

    a) silently do the wrong thing with his numbers, as it does now;

    b) immediately disabuse him of the notion that this is viable syntax
       (and yes, the SyntaxWarning should be more gentle than it
       currently is, but that is a subject for a different PEP); or

    c) let him continue to think that computers are happy with
       multi-digit decimal integers which start with "0".

Some people passionately believe that (c) is the correct answer,
and they would be absolutely right if we could be sure that new
users will never blossom and grow and start writing AJAX applications.

So while a new Python user may (currently) be mystified at the
delayed discovery that his numbers don't work properly, we can
fix it by explaining to him immediately that Python doesn't like
leading zeros (hopefully with a reasonable message!), or we can
delegate this teaching experience to the JavaScript interpreter
in the Internet Explorer browser, and let him try to debug his
issue there.

Supported radices
-----------------

This PEP proposes that the supported radices for the Python
language will be 2, 8, 10, and 16.

Once it is agreed that the old syntax for octal (radix 8) representation
of integers must be removed from the language, the next obvious
question is "Do we actually need a way to specify (and display)
numbers in octal?"

This question is quickly followed by "What radices does the language
need to support?"  Because computers are so adept at doing what you
tell them to, a tempting answer in the discussion was "all of them."
This answer has obviously been given before -- the int() constructor
will accept an explicit radix with a value between 2 and 36, inclusive,
with the latter number bearing a suspicious arithmetic similarity to
the sum of the number of numeric digits and the number of same-case
letters in the ASCII alphabet.

But the best argument for inclusion will have a use-case to back
it up, so the idea of supporting all radices was quickly rejected,
and the only radices left with any real support were decimal,
hexadecimal, octal, and binary.

Just because a particular radix has a vocal supporter on the
mailing list does not mean that it really should be in the
language, so the rest of this section is a treatise on the
utility of these particular radices, vs. other possible choices.

Humans use other numeric bases constantly.  If I tell you that
it is 12:30 PM, I have communicated quantitative information
arguably composed of *three* separate bases (12, 60, and 2),
only one of which is in the "agreed" list above.  But the
*communication* of that information used two decimal digits
each for the base 12 and base 60 information, and, perversely,
two letters for information which could have fit in a single
decimal digit.

So, in general, humans communicate "normal" (non-computer)
numerical information either via names (AM, PM, January, ...)
or via use of decimal notation.  Obviously, names are
seldom used for large sets of items, so decimal is used for
everything else.  There are studies which attempt to explain
why this is so, typically reaching the expected conclusion
that the Arabic numeral system is well-suited to human
cognition. [3]_

There is even support in the history of the design of
computers to indicate that decimal notation is the correct
way for computers to communicate with humans.  One of
the first modern computers, ENIAC [4]_ computed in decimal,
even though there were already existing computers which
operated in binary.

Decimal computer operation was important enough
that many computers, including the ubiquitous PC, have
instructions designed to operate on "binary coded decimal"
(BCD) [5]_ , a representation which devotes 4 bits to each
decimal digit.  These instructions date from a time when the
most strenuous calculations ever performed on many numbers
were the calculations actually required to perform textual
I/O with them.  It is possible to display BCD without having
to perform a divide/remainder operation on every displayed
digit, and this was a huge computational win when most
hardware didn't have fast divide capability.  Another factor
contributing to the use of BCD is that, with BCD calculations,
rounding will happen exactly the same way that a human would
do it, so BCD is still sometimes used in fields like finance,
despite the computational and storage superiority of binary.

So, if it weren't for the fact that computers themselves
normally use binary for efficient computation and data
storage, string representations of integers would probably
always be in decimal.

Unfortunately, computer hardware doesn't think like humans,
so programmers and hardware engineers must often resort to
thinking like the computer, which means that it is important
for Python to have the ability to communicate binary data
in a form that is understandable to humans.

The requirement that the binary data notation must be cognitively
easy for humans to process means that it should contain an integral
number of binary digits (bits) per symbol, while otherwise
conforming quite closely to the standard tried-and-true decimal
notation (position indicates power, larger magnitude on the left,
not too many symbols in the alphabet, etc.).

The obvious "sweet spot" for this binary data notation is
thus octal, which packs the largest integral number of bits
possible into a single symbol chosen from the Arabic numeral
alphabet.

In fact, some computer architectures, such as the PDP8 and the
8080/Z80, were defined in terms of octal, in the sense of arranging
the bitfields of instructions in groups of three, and using
octal representations to describe the instruction set.

Even today, octal is important because of bit-packed structures
which consist of 3 bits per field, such as Unix file permission
masks.

But octal has a drawback when used for larger numbers.  The
number of bits per symbol, while integral, is not itself
a power of two.  This limitation (given that the word size
of most computers these days is a power of two) has resulted
in hexadecimal, which is more popular than octal despite the
fact that it requires a 60% larger alphabet than decimal,
because each symbol contains 4 bits.

Some numbers, such as Unix file permission masks, are easily
decoded by humans when represented in octal, but difficult to
decode in hexadecimal, while other numbers are much easier for
humans to handle in hexadecimal.

Unfortunately, there are also binary numbers used in computers
which are not very well communicated in either hexadecimal or
octal. Thankfully, fewer people have to deal with these on a
regular basis, but on the other hand, this means that several
people on the discussion list questioned the wisdom of adding
a straight binary representation to Python.

One example of where these numbers is very useful is in
reading and writing hardware registers.  Sometimes hardware
designers will eschew human readability and opt for address
space efficiency, by packing multiple bit fields into a single
hardware register at unaligned bit locations, and it is tedious
and error-prone for a human to reconstruct a 5 bit field which
consists of the upper 3 bits of one hex digit, and the lower 2
bits of the next hex digit.

Even if the ability of Python to communicate binary information
to humans is only useful for a small technical subset of the
population, it is exactly that population subset which contains
most, if not all, members of the Python core team, so even straight
binary, the least useful of these notations, has several enthusiastic
supporters and few, if any, staunch opponents, among the Python community.

Syntax for supported radices
-----------------------------

This proposal is to to use a "0o" prefix with either uppercase
or lowercase "o" for octal, and a "0b" prefix with either
uppercase or lowercase "b" for binary.

There was strong support for not supporting uppercase, but
this is a separate subject for a different PEP, as 'j' for
complex numbers, 'e' for exponent, and 'r' for raw string
(to name a few) already support uppercase.

The syntax for delimiting the different radices received a lot of
attention in the discussion on Python-3000.  There are several
(sometimes conflicting) requirements and "nice-to-haves" for
this syntax:

    - It should be as compatible with other languages and
      previous versions of Python as is reasonable, both
      for the input syntax and for the output (e.g. string
      % operator) syntax.

    - It should be as obvious to the casual observer as
      possible.

    - It should be easy to visually distinguish integers
      formatted in the different bases.

Proposed syntaxes included things like arbitrary radix prefixes,
such as 16r100 (256 in hexadecimal), and radix suffixes, similar
to the 100h assembler-style suffix.  The debate on whether the
letter "O" could be used for octal was intense -- an uppercase
"O" looks suspiciously similar to a zero in some fonts.  Suggestions
were made to use a "c" (the second letter of "oCtal"), or even
to use a "t" for "ocTal" and an "n" for "biNary" to go along
with the "x" for "heXadecimal".

For the string % operator, "o" was already being used to denote
octal, and "b" was not used for anything, so this works out
much better than, for example, using "c" (which means "character"
for the % operator).

At the end of the day, since uppercase "O" can look like a zero
and uppercase "B" can look like an 8, it was decided that these
prefixes should be lowercase only, but, like 'r' for raw string,
that can be a preference or style-guide issue.

Open Issues
===========

It was suggested in the discussion that lowercase should be used
for all numeric and string special modifiers, such as 'x' for
hexadecimal, 'r' for raw strings, 'e' for exponentiation, and
'j' for complex numbers.  This is an issue for a separate PEP.

This PEP takes no position on uppercase or lowercase for input,
just noting that, for consistency, if uppercase is not to be
removed from input parsing for other letters, it should be
added for octal and binary, and documenting the changes under
this assumption, as there is not yet a PEP about the case issue.

Output formatting may be a different story -- there is already
ample precedence for case sensitivity in the output format string,
and there would need to be a consensus that there is a valid
use-case for the "alternate form" of the string % operator
to support uppercase 'B' or 'O' characters for binary or
octal output.  Currently, PEP3101 does not even support this
alternate capability, and the hex() function does not allow
the programmer to specify the case of the 'x' character.

There are still some strong feelings that '0123' should be
allowed as a literal decimal in Python 3.0.  If this is the
right thing to do, this can easily be covered in an additional
PEP.  This proposal only takes the first step of making '0123'
not be a valid octal number, for reasons covered in the rationale.

Is there (or should there be) an option for the 2to3 translator
which only makes the 2.6 compatible changes?  Should this be
run on 2.6 library code before the 2.6 release?

Should a bin() function which matches hex() and oct() be added?

Is hex() really that useful once we have advanced string formatting?

References
==========

.. [1] GNU libc manual printf integer format conversions
   (http://www.gnu.org/software/libc/manual/html_node/Integer-Conversions.html)

.. [2] Python string formatting operations
   (http://docs.python.org/lib/typesseq-strings.html)

.. [3] The Representation of Numbers, Jiajie Zhang and Donald A. Norman
    (http://acad88.sahs.uth.tmc.edu/research/publications/Number-Representation.pdf)

.. [4] ENIAC page at wikipedia
    (http://en.wikipedia.org/wiki/ENIAC)

.. [5] BCD page at wikipedia
    (http://en.wikipedia.org/wiki/Binary-coded_decimal)

Copyright
=========

This document has been placed in the public domain.