[Python-ideas] Adding 'bytes' as alias for 'latin_1' codec.

Thu May 26 17:56:48 CEST 2011

On 2011-05-26, at 16:55 , Nick Coghlan wrote:

> On Thu, May 26, 2011 at 9:17 PM, Masklinn <masklinn at masklinn.net> wrote:
>> Considering the original use case, which seems to be mostly about being able to use .format, would it make more sense to be able to create "byte patterns", with formats similar to those of str.format but not identical (e.g. better control on layout would be nice, something similar to Erlang's bit syntax for putting binaries together).
>> 
>> This would be useful to put together byte sequences from existing values to e.g. output binary formats.
> 
> We already have an entire module dedicated to the task of handling
> binary formats: http://docs.python.org/py3k/library/struct
Sure, but:

1. It does not matter overly much, there are many cases where this did not stop the core team from agreeing the problem was insufficiently well solved (latest instance: string formatting, the current builtin solution being predated by an other builtin and at least one previous stdlib solution)

2. struct suffers from a bunch of issues
  - it ranks low in discoverability, people who have not bit-twiddled much in C may not realize that a struct (in C) is just an interpretation pattern on a byte string, and it's advertised as an interaction between Python and C structs, not arbitrary bytes patterns/building
  - struct format strings are "wonky" (in that they're nothing like those of str.format)
  - struct format strings simply can't deal with mixing literal "character bytes" and format specs, making formats with fixed ascii structures significantly less readable

> "format(n, '6d').encode('ascii')" is the right way to get the string
> representation of a number as ASCII bytes. However, the programmer
> needs to be aware that concatenating those bytes with an encoding that
> is not ASCII compatible (such as UTF-16, UTF-32, or many of the Asian
> encodings) will result in a sequence of unusable garbage. It is far,
> far safer to transform everything into the text domain, work with it
> there, then encode back when the manipulation is complete.
Sure, but as you noted this is not even always done in the stdlib, why third-party developers would be expected to be in a better situation?

And between jumping through a semi-arbitrary decode/encode cycle whose semantics are completely ignored and being able to just specify a bytes pattern, which seems stranger?

And I'm probably overstating its importance, but erlang seems to do rather well with its bit syntax. Which is much closer to str.format than to struct.pack (in API, in looks, in complexity, …)