[Python-Dev] PEP 461 updates

Thu Jan 16 17:42:58 CET 2014

Carl Meyer <carl at oddbird.net> wrote:
> I think the PEP could really use a rationale section summarizing _why_
> these formatting operations are being added to bytes

I agree.  My attempt at re-writing the PEP is below.

>> In order to avoid the problems of auto-conversion and
>> value-generated exceptions, all object checking will be done via
>> isinstance, not by values contained in a Unicode representation.
>> In other words::
>> 
>>   - duck-typing to allow/reject entry into a byte-stream
>>   - no value generated errors
>
> This seems self-contradictory; "isinstance" is type-checking, which is
> the opposite of duck-typing.

Again, I agree.  We should avoid isinstance checks if possible.

Abstract
========

This PEP proposes adding %-interpolation to the bytes object.

Rational
========

A distruptive but useful change introduced in Python 3.0 was the clean
separation of byte strings (i.e. the "bytes" object) from character
strings (i.e. the "str" object).  The benefit is that character
encodings must be explicitly specified and the risk of corrupting
character data is reduced.

Unfortunately, this separation has made writing certain types of
programs more complicated and verbose.  For example, programs that deal
with network protocols often manipulate ASCII encoded strings.  Since
the "bytes" type does not support string formatting, extra encoding and
decoding between the "str" type is required.

For simplicity and convenience it is desireable to introduce formatting
methods to "bytes" that allow formatting of ASCII-encoded character
data.  This change would blur the clean separation of byte strings and
character strings.  However, it is felt that the practical benefits
outweigh the purity costs.  The implicit assumption of ASCII-encoding
would be limited to formatting methods.

One source of many problems with the Python 2 Unicode implementation is
the implicit coercion of Unicode character strings into byte strings
using the "ascii" codec.  If the character strings contain only ASCII
characters, all was well.  However, if the string contains a non-ASCII
character then coercion causes an exception.

The combination of implicit coercion and value dependent failures has
proven to be a recipe for hard to debug errors.  A program may seem to
work correctly when tested (e.g. string input that happened to be ASCII
only) but later would fail, often with a traceback far from the source
of the real error.  The formatting methods for bytes() should avoid this
problem by not implicitly encoding data that might fail based on the
content of the data.

Another desirable feature is to allow arbitrary user classes to be used
as formatting operands.  Generally this is done by introducing a special
method that can be implemented by the new class.

Proposed semantics for bytes formatting
=======================================

Special method __ascii__
------------------------

A new special method, analogous to __format__, is introduced.  This
method takes a single argument, a format specifier.  The return
value is a bytes object.  Objects that have an ASCII only
representation can implement this method to allow them to be used as
format operators.  Objects with natural byte representations should
implement __bytes__ or the Py_buffer API.

%-interpolation
---------------

All the numeric formatting codes (such as %x, %o, %e, %f, %g, etc.)
will be supported, and will work as they do for str, including the
padding, justification and other related modifiers.  To avoid having to
introduce two special methods, the format specifications will be
translated to equivalent __format__ specifiers and __ascii__ method
of each argument would be called.

Example::

   >>> b'%4x' % 10
   b'   a'

%c will insert a single byte, either from an int in range(256), or from
a bytes argument of length 1.

Example:

    >>> b'%c' % 48
    b'0'

    >>> b'%c' % b'a'
    b'a'

%s is a restricted in what it will accept::

  - input type supports Py_buffer or has __bytes__?
    use it to collect the necessary bytes (may contain non-ASCII
    characters)

  - input type is something else?
    use its __ascii__ method; if there isn't one, raise TypeErorr

Examples:

    >>> b'%s' % b'abc'
    b'abc'

    >>> b'%s' % 3.14
    b'3.14'

    >>> b'%4s' % 12
    b'  12'

    >>> b'%s' % 'hello world!'
    Traceback (most recent call last):
    ...
    TypeError: 'hello world' has no __ascii__ method, perhaps you need to encode it?

.. note::

   Because the str type does not have a __ascii__ method, attempts to
   directly use 'a string' as a bytes interpolation value will raise an
   exception.  To use 'string' values, they must be encoded or otherwise
   transformed into a bytes sequence::

      'a string'.encode('latin-1')

Unsupported % format codes
^^^^^^^^^^^^^^^^^^^^^^^^^^

%r (which calls __repr__) is not supported

format
------

The format() method will not be implemented at this time but may be
added in a later Python release.  The __ascii__ method is designed
to make adding it later simpler.

Open Questions
==============

Do we need to support the complete set of format codes?  For complicated
formatting perhaps using the str object to do the formatting and
encoding the result is sufficient.

Should Python check that the bytes returned by __ascii__  are in
the range 0-127 (i.e. ASCII)?  That seems of little utility since
the error would be similar to a unicode-to-str coercion failure in
Python 2 and the traceback would normally be far removed from the
real error.  Built-in types would be designed to never return
non-ASCII characters from  the __ascii__ method.

Proposed variations
===================

Instead of introducing a new special method, have numeric types
implement __bytes__.

  - Adding __bytes__ to the int object is not backwards compatible.
    bytes(<int>) already has an incompatible meaning.

It has been suggested to use %b for bytes instead of %s.

  - Rejected, using %s will making porting code from Python 2 easier.

It was suggested to disallow %s from accepting numbers.

  - Rejected, to ease porting of Python 2 code, %s should accept
    number operands.

It has been proposed to automatically use .encode('ascii','strict') for str
arguments to %s.

  - Rejected as this would lead to intermittent failures.  Better to have the
    operation always fail so the trouble-spot can be correctly fixed.

It has been proposed to have %s return the ascii-encoded repr when the value
is a str  (b'%s' % 'abc'  --> b"'abc'").

  - Rejected as this would lead to hard to debug failures far from the problem
    site.  Better to have the operation always fail so the trouble-spot can be
    easily fixed.

Instead of having %-interpolation call __ascii__, introduce a second
special method analogous to __str__ and have %s call it.

  - Rejected, __ascii__ is both necessary for implementing format()
    and sufficient for %-interpolation.  While implementing a
    __ascii__ method is more complicated due to the specifier
    argument, the number of classes which will do so are limited.

Copyright
=========

This document has been placed in the public domain.

..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End: