[Python-ideas] a new bytestring type?

Mon Jan 6 12:19:08 CET 2014

On Mon, Jan 6, 2014 at 11:57 AM, Stephen J. Turnbull <stephen at xemacs.org> wrote:

>  > I'm not missing a new type, but I am missing the format method on the
>  > binary types.
>
> I'm curious about precisely what your use cases are, and just what
> formatting they need.

One use case I came across was when creating chunks for the HTTP
chunked encoding. Chunks contain a ascii header, a raw/encoded chunk
body, and an ascii trailer. Using a bytes.format, it would look like
this:

  chunk = '{0:X}\r\n{1}\r\n'.format(len(buf), buf)

This is what I am using now:

  chunk = bytearray()
  chunk.extend('{0:X}\r\n'.format(len(buf)).encode('ascii'))
  chunk.extend(buf)
  chunk.extend('\r\n'.encode('ascii'))

Regards,
Geert

>
> The problem that Python 2 code has over and over imposed on me is that
> the temptation to avoid the overhead of conversion to and then from
> unicode when processing text by just using str results in the
> equivalent of
>
>     bs1 = returns_a_bytestring_encoded_in_utf8()
>     bs2 = returns_a_bytestring_encoded_in_koi8()
>
>     bs3 = b'{0} {1}'.format(bs1, bs2)
>     # and lose big when something expects valid UTF-8 in bs3
>
> In low-level code, the assignments to bs1, bs2, and bs3 are likely to
> be in three separate contexts, even three separate modules.  I
> understand about consenting adults, but it's just too hard to enforce
> good practice here if you make it easy to pass around and operate on
> encoded bytestrings.  I don't see how you avoid this pitfall, except
> by making it easier to pass around Unicode than encoded strings.  And
> given that encoding and decoding are unavoidable, that means making
> use of bytestrings with text semantics painful.
>
> So to answer my question from my own point of view, for example, I
> would have no problem at all with
>
>     b'{0:c}'.format(27) == b'\x1b'           # insert an ASCII ESC character
>
> I would be leery of
>
>     b'{0:s}'.format(b'\x1b[M') == b'\x1b[M'  # insert a ANSI control sequence
>
> for the reason given above (for this use case, I would prefer
>
>     blue_code = ord('M')                    # Or b'M', doesn't matter!
>     b'\x1b[{0:c}'.format(blue_code) == b'\x1b[M'
>
> -- and forgive me for not looking up my ANSI color sequences, it's
> only luck if that's close) and I would consider
>
>     b'{0:d}'.format(27) == b'27'             # insert the ASCII representation
>
> to be an abomination since there's no reason to suppose that any given
> bytestring is encoded in an ASCII-compatible way, or bigendian for
> that matter.  Ditto everything else that involves representing a
> number as a string of numeric characters.
>