[Python-ideas] a new bytestring type?

Mon Jan 6 12:52:33 CET 2014

I didn't receive Stephen's email, so forgive me for replying through a reply…

From: Geert Jansen <geertj at gmail.com>
Sent: Monday, January 6, 2014 3:19 AM

> On Mon, Jan 6, 2014 at 11:57 AM, Stephen J. Turnbull <stephen at xemacs.org> 
> wrote:
> 
>>   > I'm not missing a new type, but I am missing the format method on 
> the
>>   > binary types.
>> 
>>  I'm curious about precisely what your use cases are, and just what
>>  formatting they need.

Besides Geert's chunked HTTP example, there are tons of intern protocols and file formats (including Python source code!), that have ASCII headers (that in some way define an encoding for the actual payload). So things like b'Content-Length: {}'.format(len(payload)) or even b'Content-Type: text/html; charset={}'.format(encoding) are useful.

>> … I would consider
>> 
>>      b'{0:d}'.format(27) == b'27'             # insert the ASCII representation
>> 
>>  to be an abomination since there's no reason to suppose that any given
>>  bytestring is encoded in an ASCII-compatible way, or bigendian for
>>  that matter.  Ditto everything else that involves representing a
>>  number as a string of numeric characters.

Endianness isn't relevant here; b'{}'.format(32768) is b'32768', not b'\x80\x00' or b'\x00\x80'. That's what the d format means.

As for assuming that it's ASCII-compatible, again, there are all kinds of protocols that work with any ASCII-compatbile charset but don't work otherwise. Yeah, this can be a problem if you want to create an HTTP page or a Python source file in EBCDIC or UTF-16-LE—but even then, if the headers are interpreted as pure ASCII and then the payload is extracted and decoded separately, it still works. In fact, it works better than if people try to construct everything as text end then encode, giving you illegal/unreadable EBCDIC headers, and this is a common incorrect workaround that Python 2-familiar people do when forced to deal with Python 3.

Obviously you could solve most of the same problems by formatting the headers as text, encoding them to ASCII, then concatenating the payload. And I'm not really worried about performance issues with that. But I am worried about convenience and readability—compare the desired and actual versions of Geert's code.

As I said in my other email, I might be happy assuming ASCII-strict for everything that isn't a buffer, and copying bytes as-is for everything that is. That _might_ be more of an attractive nuisance than a useful feature, but… it definitely is attractive, and I'm not sure it's a nuisance.