[Python-ideas] Bytes formatting (was Re: Adding 'bytes' as alias for 'latin_1' codec)
Terry Reedy
tjreedy at udel.edu
Mon May 30 22:04:36 CEST 2011
Changing the subject to what it has actually become.
On 5/27/2011 5:27 AM, Nick Coghlan wrote:
> We can almost certainly do better when it comes to constructing byte
> sequences from component parts, but simply saying "oh, just add a
> format() method to bytes objects" doesn't cut it, since the associated
> magic methods for str.format are all string based,
STRING FORMATTING
From a modern and Python viewpoint, string formatting is about
interpolating text representations of objects into a text template. By
default, the text representation is str(object).
Exception 1. str.format has an optional conversion specifier "!s/r/a" to
specify repr(object) or ascii(object) instead of str(object). (It can
also be used to overrides exception 2.) This is not relevant to bytes
formatting.
Exception 2.str.format, like % formatting, does special processing of
numbers. Electronic computing was originally used only to compute
numbers and text formatting was originally about formatting numbers,
usually in tables, with optional text decoration. That is why the
maximum field size for string interpolation is still called 'precision'.
There are numerous variations in number formatting and most of the
complication of format specifications arise therefrom.
BYTES FORMATTING
If the desired result consists entirely of text encoded with one
encoding, the current recommended method is to construct the text and
encode. I think this is the proper method and do not think that anything
we add should be aimed at this use case.
There are two other current methods to assemble bytes from pieces. One
is concatenation; it has the same advantages and disadvantages of string
concatenation. Another, overlooked in the current discussion so afr, is
in-place editing of a bytearray by index and slice assignment. It has
the disadvantage of having to know the correct indexes and slice points.
If we add another bytes formatting function or method, I think it should
be about interpolating bytes into a bytes template. The use cases would
be anything other than mono-encoded text -- text with multiple encodings
or non-text bytes possibly intermixed with encoded text.
> and bytes interpolation also needs to address encoding issues
> for anything that isn't already a byte sequence.
As indicated above, I disagree if 'encoding' means 'text encoding'.
Let .encode handle encoding issues.
PROPOSAL
A bytes template uses b'{' and b'}' to mark interpolation fields and
other ascii bytes within as needed. It uses the ascii equivalent of the
string field_name spec. It does not have a conversion spec. The
format_spec should have the minimum needed for existing public
protocols. How much more is up for discussion. We need use cases.
One possibility to keep in mind is that a bytes template could
constructed by an ascii-compatible encoding of formatted text. Specs for
bytes fields can be protected in a text template by doubling the braces.
>>> '{} {{byte-field-spec}}'.format(1).encode()
b'1 {byte-field-spec}'
A major issue is what to do with numbers. Sometimes they needed to be
ascii encoded, sometime binary encoded. The baseline is to do nothing
extra and require all args to be bytes. I think this may be appropriate
for floats as they are seldom specifically used in protocols. I think
the same may be true for ints with signs. So I think we mainly need to
consider counts (unsigned ints) for possible exceptional processing.
Option 0. As stated, no special number specs.
Option 1. Use a subset of the current int spec to produce ascii
encodings; use struct.pack for binary encodings. (How many of the
current integer presentation types would be needed?)
Option 2. Use an adaptation of the struct.pack mini-language to produce
binary encodings; use encoded str.format for ascii encodings. (The
latter might be done as part of a text-to-bytes-template process as
indicated above.)
Option 3. Combine options 1 and 2. This might best be done by replacing
the omitted 'conversion' field with a 'number-encoding' field, b'!a' or
b'!b', to indicate ascii or binary conversion and corresponding
interpretation of the format spec. (In other words, do not try to
combine the number to text and number to binary mini-languages, but add
a 'prefix' to specify which is being used.)
--
Terry Jan Reedy
More information about the Python-ideas
mailing list