On 09/17/2014 06:57 AM, Nick Coghlan wrote:
The point on the issue tracker was that while this is a good way to obtain the flexibility, adhering too closely to the "standard format syntax" as I did likely isn't a good idea. Instead, we'd be better going for the strftime model where a type specific format (e.g. as an argument to the new *.hex() methods being discussed in http://bugs.python.org/issue) is *also* supported via __format__.
One thing I'd like to not support here that strftime does: arbitrary pass-through text in the format string. It's useful for date/time, but not here. And your examples below don't allow it, I just wanted to be clear.
For example, inspired directly by the way hex editors work, you could envision an approach where you had a base format character (chosen to be orthogonal to the default format characters):
"h": lowercase hex "H": uppercase hex "A": ASCII (using "." for unprintable & extended ASCII) format(b"xyz", "A") -> 'xyz' format(b"xyz", "h") -> '78797a' format(b"xyz", "H") -> '78797A'
Followed by a separator and "chunk size":
format(b"xyz", "h 1") -> '78 79 7a' format(b"abcdwxyz", "h 4") -> '61626364 7778797a' format(b"xyz", "h,1") -> '78,79,7a' format(b"abcdwxyz", "h,4") -> '61626364,7778797a' format(b"xyz", "h:1") -> '78:79:7a' format(b"abcdwxyz", "h:4") -> '61626364:7778797a'
In the "h" and "H" cases, you could request a preceding "0x" on the chunks:
format(b"xyz", "h#") -> '0x78797a' format(b"xyz", "h# 1") -> '0x78 0x79 0x7a' format(b"abcdwxyz", "h# 4") -> '0x61626364 0x7778797a'
The section before the format character would use the standard string formatting rules: alignment, fill character, width, precision
I think that's too confusing. For example, '#' is also allowed before the format character: [[fill]align][sign][#][0][width][,][.precision][type]
And precision doesn't make sense for bytes (and is currently not allowed for int). So you'd instead have the complete format specifier be: [[fill]align][sign][#][0][width][type][#][internal-fill][chunksize]
I think "sign" might have to go: it doesn't make sense. Not sure about "0". Let's say they both go, and we're left with: [[fill]align][width][type][#][internal-fill][chunksize]
I'm not completely opposed to this, but I think we can do better.
I see basically 3 options for byte format specifiers:
1. Support exactly what the standard types (int, str, float, etc.) support, but give slightly different semantics to it. This is what Nick originally proposed on the issue tracker.
2. Support a slightly different format specifier. This is what Nick proposes above, and I discuss more below. The downside of this is that it might be confusing to some users, who see the printf-like formatting as some universal standard. It's also hard to document.
3. Do something radically different. I gave an example on the issue tracker, but I'm not totally serious about this.
Here's my proposal for #2: The format specifier becomes: [[fill]align][#][width][separator]][/chunksize][type]
The only difference here (from the standard format specifier) is that I've substituted "/chunksize" for ".precision", and generalized the separator. I think "/" reads well as "divide into chunks this size". We might want to restrict "separator" to a few characters, maybe one of space, colon, dot, comma. I think Nick's usage of 'A', 'H', and 'h' for the "type" character is good, although I'd really prefer 'a'. And it's possible 'x' and 'X' would be less confusing (because it's more familiar), but maybe it does increase confusion. Let's keep 'h' and 'H' for now, just for discussion purposes.
So, Nick's examples become:
format(b"xyz", "a") -> 'xyz' format(b"xyz", "h") -> '78797a' format(b"xyz", "H") -> '78797A'
Followed by a separator and "chunk size":
format(b"xyz", "/1h") -> '78 79 7a' format(b"abcdwxyz", "/4h") -> '61626364 7778797a'
format(b"xyz", ",/1h") -> '78,79,7a' format(b"abcdwxyz", ",/4h") -> '61626364,7778797a'
format(b"xyz", ":/1h") -> '78:79:7a' format(b"abcdwxyz", ":/4h") -> '61626364:7778797a'
format(b"xyz", "#h") -> '0x78797a' format(b"xyz", "#/1h") -> '0x78 0x79 0x7a' format(b"abcdwxyz", "#/4h") -> '0x61626364 0x7778797a'
I really haven't thought through parsing this format specifier. Obviously "separator" will have some restrictions, like it can't be a slash. I'll have to give it some more thought.
As with the standard format specifiers, there are some restrictions. 'a' couldn't have '#', for example. But I don't see why it couldn't have 'chunksize'.
Eric.