[Python-3000] Thoughts on new I/O library and bytecode

Sun Mar 4 00:52:39 CET 2007

On 3/3/07, Gareth McCaughan <gareth.mccaughan at pobox.com> wrote:
> On Tuesday 27 February 2007 00:39, Greg Ewing wrote:
>
> > I can't help feeling the people arguing for b"..." as the
> > repr format haven't really accepted the fact that text and
> > binary data will be distinct things in py3k, and are thinking
> > of bytes as being a replacement for the old string type. But
> > that's not true -- most of the time, *unicode* will be the
> > replacement for str when it is used to represent characters,
> > and bytes will mostly be used only for non-text.
> [etc.]
>
> ... but Guido prefers to use b"..." as the repr format,
> on the grounds that byte-sequences quite often are
> lightly encoded text, and that when that's true it
> can be *much* better to report them as such.

I agree with Guido here. As a person that's written a lot of protocol
implementations and parser/generators for a few strange binary
formats... the literal syntax that lets me use ASCII is what I would
prefer.

I would have to say that most protocols these days are lightly encoded
text anyway, so it's most beneficial to optimize for the ASCII case.

> Here's an ugly, impure, but possibly practical answer:
> give each bytes object a single-bit flag meaning something
> like "mostly textual"; make the bytes([1,2,3,4]) constructor
> set it to false, the b"abcde" constructor set it to true,
> and arbitrary operations on bytes objects do ... well,
> something plausible :-). (Textuality/non-textuality is
> generally preserved; combining texual and non-textual
> yields non-textual.) Then repr() can look at that flag
> and decide what to do on the basis of it.

That sounds like a generally bad idea... Even if a protocol is "mostly
binary" a dump of single byte decimal integers is likely to be *less*
useful than b"\x01\x02\x03\x04". Almost all protocols deal in integers
larger than one byte, so a sequence of decimal bytes is really the
worst thing to see in those cases.

Erlang is in general really good about dealing with bytes (their
binary type) but the printed representation is suboptimal because it
behaves kinda like that.

1> Chunk = <<11:16/big, "foo bar baz">>.
<<0,11,102,111,111,32,98,97,114,32,98,97,122>>
2> <<Length:16/big, Rest/binary>> = Chunk.
<<0,11,102,111,111,32,98,97,114,32,98,97,122>>
3> <<String:Length/binary, Extra/binary>> = Rest.
<<"foo bar baz">>
4> {Length, String, Extra}.
{11,<<"foo bar baz">>,<<"">>}

When Erlang is printing the "repr" of a list or binary term to the
shell it first checks to see if every item is printable ASCII integer.
If so, then it prints as an ASCII string. Otherwise, it prints as a
list of decimal integers. It doesn't work out well in these kinds of
situations. If it was printed out as ASCII with hex escapes then it
would make a lot more sense at a glance.

-bob