[Python-Dev] Shorter float repr in Python 3.1?

Tue Apr 7 16:39:47 CEST 2009

Executive summary (details and discussion points below)
=================

Some time ago, Noam Raphael pointed out that for a float x,
repr(x) can often be much shorter than it currently is, without
sacrificing the property that eval(repr(x)) == x, and proposed
changing Python accordingly.  See

http://bugs.python.org/issue1580

For example, instead of the current behaviour:

Python 3.1a2+ (py3k:71353:71354, Apr  7 2009, 12:55:16)
[GCC 4.0.1 (Apple Inc. build 5490)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 0.01
0.01
>>> 0.02
0.02
>>> 0.03
0.029999999999999999
>>> 0.04
0.040000000000000001
>>> 0.04 == eval(repr(0.04))
True

we'd have this:

Python 3.1a2+ (py3k-short-float-repr:71350:71352M, Apr  7 2009, )
[GCC 4.0.1 (Apple Inc. build 5490)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 0.01
0.01
>>> 0.02
0.02
>>> 0.03
0.03
>>> 0.04
0.04
>>> 0.04 == eval(repr(0.04))
True

Initial attempts to implement this encountered various
difficulties, and at some point Tim Peters pointed out
(I'm paraphrasing horribly here) that one can't have all
three of {fast, easy, correct}.

One PyCon 2009 sprint later, Eric Smith and I have
produced the py3k-short-float-repr branch, which implements
short repr of floats and also does some major cleaning
up of the current float formatting functions.
We've gone for the {fast, correct} pairing.
We'd like to get this into Python 3.1.

Any thoughts/objections/counter-proposals/...?

More details
============
Our solution is based on an adaptation of David Gay's
'perfect rounding' code for inclusion in Python.  To make
eval(repr(x)) roundtripping work, one needs to have
correctly rounded float -> decimal *and* decimal -> float
conversions:  Gay's code provides correctly rounded
dtoa and strtod functions for these two conversions.
His code is well-known and well-tested:  it's used as the
basis of the glibc strtod, and is also in OS X.  It's
available from

http://www.netlib.org/fp/dtoa.c

So our branch contains a new file Python/dtoa.c,
which is a cut down version of Gay's original file. (We've
removed stuff for VAX and IBM floating-point formats,
hex NaNs, hex floating-point formats, locale-aware
interpretation of the decimal separator, K&R headers,
code for correct setting of the inexact flag, and various
other bits and pieces that Python doesn't care about.)

Most of the rest of the work is in the existing file
Python/pystrtod.c.  Every float -> string or string -> float
conversion goes through a function in this file at
some point.

Gay's code also provides the opportunity to clean
up the current float formatting code, and Eric has
reworked a lot of the float formatting in the py3k-short-float-repr
branch.  This reworking should make finishing off the
implementation of things like thousands separators much
more straightforward.

One example of this:  the previous string -> float conversion
used the system strtod, which is locale-aware, so the code
had to first replace the '.' by the current locale's decimal
separator, *then* call strtod.  There was a similar dance in
the reverse direction when doing float -> string conversion.
Both these are now unnecessary.

The current code is pretty close to ready for merging
to py3k.  I've uploaded a patchset to Rietveld:

http://codereview.appspot.com/33084/show

Apart from the short float repr, and a couple of bugfixes,
all behaviour should be unchanged from before.  There
are a few exceptions:

 - format(1e200, '<') doesn't behave quite as it did
   before.  See item (3) below for details

 - repr switches to using exponential notation at
   1e16 instead of the previous 1e17.  This avoids
   a subtle issue where the 'short float repr' result
   is padded with bogus zeros.

 - a similar change applies to str, which switches
   to exponential notation at 1e11, not 1e12.  This
   fixes the following minor annoyance, which goes
   back at least as far as Python 2.5 (and probably
   much further):

   >>> x = 1e11 + 0.5
   >>> x
   100000000000.5
   >>> print(x)
   100000000000.0

    That .0 seems wrong to me:  if we're going to
    go to the trouble of printing extra digits (str
    usually only gives 12 significant digits; here
    there are 13), they should be the *right* extra digits.

Discussion points
=================

(1) Any objections to including this into py3k?  If there's
controversy, then I guess we'll need a PEP.

(2) Should other Python implementations (Jython,
IronPython, etc.) be expected to use short float repr, or should
it just be considered an implementation detail of CPython?
I propose the latter, except that all implementations should
be required to satisfy eval(repr(x)) == x for finite floats x.

(3) There's a PEP 3101 line we don't know what to do with.
In py3k, we currently have:

>>> format(1e200, '<')
'1.0e+200'

but in our py3k-short-float-repr branch:

>>> format(1e200, '<')
'1e+200'

Which is correct? The py3k behaviour
comes from the 'Standard Format Specifiers' section of
PEP 3101, where it says:

"""
The available floating point presentation types are:

[... list of other format codes omitted here ...]

'' (None) - similar to 'g', except that it prints at least one
              digit after the decimal point.
"""

It's that 'at least one digit after the decimal point' bit
that's at issue.  I understood this to apply only to
floats converted to a string *without* an exponent;
this is the way that repr and str work, adding a .0
to floats formatted without an exponent, but leaving
the .0 out when the exponent is present.

Should the .0 always be added?  Or is it required
only when it would be necessary to distinguish
a float string from an integer string?

My preference is for the latter (i.e., format(x, '<')
should behave in the same way as repr and str
in this respect).  But I'm biased, not least because
the other behaviour would be a pain to implement.
Does anyone care?

This email is already too long.  I'll stop now.

Mark