[Python-ideas] Stop displaying elements of bytes objects as printable ASCII characters in CPython 3

Wed Sep 17 13:48:51 CEST 2014

On 09/17/2014 06:57 AM, Nick Coghlan wrote:
> The point on the issue tracker was that while this is a good way to
> obtain the flexibility, adhering too closely to the "standard format
> syntax" as I did likely isn't a good idea. Instead, we'd be better
> going for the strftime model where a type specific format (e.g. as an
> argument to the new *.hex() methods being discussed in
> http://bugs.python.org/issue) is *also* supported via __format__.

One thing I'd like to not support here that strftime does: arbitrary
pass-through text in the format string. It's useful for date/time, but
not here. And your examples below don't allow it, I just wanted to be clear.

> For example, inspired directly by the way hex editors work, you could
> envision an approach where you had a base format character (chosen to
> be orthogonal to the default format characters):
> 
>     "h": lowercase hex
>     "H": uppercase hex
>     "A": ASCII (using "." for unprintable & extended ASCII)
> 
>     format(b"xyz", "A") -> 'xyz'
>     format(b"xyz", "h") -> '78797a'
>     format(b"xyz", "H") -> '78797A'
> 
> Followed by a separator and "chunk size":
> 
>     format(b"xyz", "h 1") -> '78 79 7a'
>     format(b"abcdwxyz", "h 4") -> '61626364 7778797a'
> 
>     format(b"xyz", "h,1") -> '78,79,7a'
>     format(b"abcdwxyz", "h,4") -> '61626364,7778797a'
> 
>     format(b"xyz", "h:1") -> '78:79:7a'
>     format(b"abcdwxyz", "h:4") -> '61626364:7778797a'
> 
> In the "h" and "H" cases, you could request a preceding "0x" on the chunks:
> 
>     format(b"xyz", "h#") -> '0x78797a'
>     format(b"xyz", "h# 1") -> '0x78 0x79 0x7a'
>     format(b"abcdwxyz", "h# 4") -> '0x61626364 0x7778797a'
> 
> The section before the format character would use the standard string
> formatting rules: alignment, fill character, width, precision

I think that's too confusing. For example, '#' is also allowed before
the format character:
[[fill]align][sign][#][0][width][,][.precision][type]

And precision doesn't make sense for bytes (and is currently not allowed
for int). So you'd instead have the complete format specifier be:
[[fill]align][sign][#][0][width][type][#][internal-fill][chunksize]

I think "sign" might have to go: it doesn't make sense. Not sure about
"0". Let's say they both go, and we're left with:
[[fill]align][width][type][#][internal-fill][chunksize]

I'm not completely opposed to this, but I think we can do better.

I see basically 3 options for byte format specifiers:

1. Support exactly what the standard types (int, str, float, etc.)
support, but give slightly different semantics to it. This is what Nick
originally proposed on the issue tracker.

2. Support a slightly different format specifier. This is what Nick
proposes above, and I discuss more below. The downside of this is that
it might be confusing to some users, who see the printf-like formatting
as some universal standard. It's also hard to document.

3. Do something radically different. I gave an example on the issue
tracker, but I'm not totally serious about this.

Here's my proposal for #2: The format specifier becomes:
[[fill]align][#][width][separator]][/chunksize][type]

The only difference here (from the standard format specifier) is that
I've substituted "/chunksize" for ".precision", and generalized the
separator. I think "/" reads well as "divide into chunks this size". We
might want to restrict "separator" to a few characters, maybe one of
space, colon, dot, comma. I think Nick's usage of 'A', 'H', and 'h' for
the "type" character is good, although I'd really prefer 'a'. And it's
possible 'x' and 'X' would be less confusing (because it's more
familiar), but maybe it does increase confusion. Let's keep 'h' and 'H'
for now, just for discussion purposes.

So, Nick's examples become:

    format(b"xyz", "a") -> 'xyz'
    format(b"xyz", "h") -> '78797a'
    format(b"xyz", "H") -> '78797A'

Followed by a separator and "chunk size":

    format(b"xyz", "/1h") -> '78 79 7a'
    format(b"abcdwxyz", "/4h") -> '61626364 7778797a'

    format(b"xyz", ",/1h") -> '78,79,7a'
    format(b"abcdwxyz", ",/4h") -> '61626364,7778797a'

    format(b"xyz", ":/1h") -> '78:79:7a'
    format(b"abcdwxyz", ":/4h") -> '61626364:7778797a'

    format(b"xyz", "#h") -> '0x78797a'
    format(b"xyz", "#/1h") -> '0x78 0x79 0x7a'
    format(b"abcdwxyz", "#/4h") -> '0x61626364 0x7778797a'

I really haven't thought through parsing this format specifier.
Obviously "separator" will have some restrictions, like it can't be a
slash. I'll have to give it some more thought.

As with the standard format specifiers, there are some restrictions. 'a'
couldn't have '#', for example. But I don't see why it couldn't have
'chunksize'.

Eric.