[Python-ideas] Bytes formatting (was Re: Adding 'bytes' as alias for 'latin_1' codec)

Mon May 30 22:04:36 CEST 2011

Changing the subject to what it has actually become.

On 5/27/2011 5:27 AM, Nick Coghlan wrote:

> We can almost certainly do better when it comes to constructing byte
> sequences from component parts, but simply saying "oh, just add a
> format() method to bytes objects" doesn't cut it, since the associated
> magic methods for str.format are all string based,

STRING FORMATTING

 From a modern and Python viewpoint, string formatting is about 
interpolating text representations of objects into a text template. By 
default, the text representation is str(object).

Exception 1. str.format has an optional conversion specifier "!s/r/a" to 
specify repr(object) or ascii(object) instead of str(object). (It can 
also be used to overrides exception 2.) This is not relevant to bytes 
formatting.

Exception 2.str.format, like % formatting, does special processing of 
numbers. Electronic computing was originally used only to compute 
numbers and text formatting was originally about formatting numbers, 
usually in tables, with optional text decoration. That is why the 
maximum field size for string interpolation is still called 'precision'. 
There are numerous variations in number formatting and most of the 
complication of format specifications arise therefrom.

BYTES FORMATTING

If the desired result consists entirely of text encoded with one 
encoding, the current recommended method is to construct the text and 
encode. I think this is the proper method and do not think that anything 
we add should be aimed at this use case.

There are two other current methods to assemble bytes from pieces. One 
is concatenation; it has the same advantages and disadvantages of string 
concatenation. Another, overlooked in the current discussion so afr, is 
in-place editing of a bytearray by index and slice assignment. It has 
the disadvantage of having to know the correct indexes and slice points.

If we add another bytes formatting function or method, I think it should 
be about interpolating bytes into a bytes template. The use cases would 
be anything other than mono-encoded text -- text with multiple encodings 
or non-text bytes possibly intermixed with encoded text.

> and bytes interpolation also needs to address encoding issues
 > for anything that isn't already a byte sequence.

As indicated above, I disagree if 'encoding' means 'text encoding'.
Let .encode handle encoding issues.

PROPOSAL

A bytes template uses b'{' and b'}' to mark interpolation fields and 
other ascii bytes within as needed. It uses the ascii equivalent of the 
string field_name spec. It does not have a conversion spec. The 
format_spec should have the minimum needed for existing public 
protocols. How much more is up for discussion. We need use cases.

One possibility to keep in mind is that a bytes template could 
constructed by an ascii-compatible encoding of formatted text. Specs for 
bytes fields can be protected in a text template by doubling the braces.

 >>> '{} {{byte-field-spec}}'.format(1).encode()
b'1 {byte-field-spec}'

A major issue is what to do with numbers. Sometimes they needed to be 
ascii encoded, sometime binary encoded. The baseline is to do nothing 
extra and require all args to be bytes. I think this may be appropriate 
for floats as they are seldom specifically used in protocols. I think 
the same may be true for ints with signs. So I think we mainly need to 
consider counts (unsigned ints) for possible exceptional processing.

Option 0. As stated, no special number specs.

Option 1. Use a subset of the current int spec to produce ascii 
encodings; use struct.pack for binary encodings. (How many of the 
current integer presentation types would be needed?)

Option 2. Use an adaptation of the struct.pack mini-language to produce 
binary encodings; use encoded str.format for ascii encodings. (The 
latter might be done as part of a text-to-bytes-template process as 
indicated above.)

Option 3. Combine options 1 and 2. This might best be done by replacing 
the omitted 'conversion' field with a 'number-encoding' field, b'!a' or 
b'!b', to indicate ascii or binary conversion and corresponding 
interpretation of the format spec. (In other words, do not try to 
combine the number to text and number to binary mini-languages, but add 
a 'prefix' to specify which is being used.)

-- 
Terry Jan Reedy