[Python-ideas] Binary f-strings

Mon Sep 28 03:23:30 CEST 2015

Now that f-strings are in the 3.6 branch, I'd like to turn my attention
to binary f-strings (fb'' or bf'').

The idea is that:

>>> bf'datestamp:{datetime.datetime.now():%Y%m%d}\r\n'

Might be translated as:

>>> (b'datestamp:' +
...  bytes(format(datetime.datetime.now(),
...               str(b'%Y%m%d', 'ascii')),
...        'ascii') +
...  b'\r\n')

Which would result in:
b'datestamp:20150927\r\n'

The only real question is: what encoding to use for the second parameter
to bytes()? Since an object must return unicode from __format__(), I
need to convert that to bytes in order to join everything together. But how?

Here I suggest 'ascii'. Unfortunately, this would give an error if
__format__ returned anything with a char greater than 127. I think we've
learned that an API that only raises an exception with certain specific
inputs is fragile.

Guido has suggested using 'utf-8' as the encoding. That has some appeal,
but if we're designing this for wire protocols, not all protocols will
be using utf-8.

Another idea would be to extend the "conversion char" from just 's',
'r', or 'a', which don't make much sense for bytes, to instead be a
string that specifies the encoding. The default could be ascii, and if
you want to specify something else:
bf'datestamp:{datetime.datetime.now()!utf-8:%Y%m%d}\r\n'

That would work for any encoding that doesn't have ':', '{', or '}' in
the encoding name. Which seems like a reasonable restriction.

And I might be over-generalizing here, but you'd presumably want to make
the encoding a non-constant:
bf'datestamp:{datetime.datetime.now()!{encoding}:%Y%m%d}\r\n'

I think my initial proposal will be to use 'ascii', and not support any
conversion characters at all for fb-strings, not even 's', 'r', and 'a'.
In the future, if we want to support encodings other than 'ascii', we
could then add !conversions mapping to encodings.

My reasoning for using 'ascii' is that 'utf-8' could easily be an error
for non-utf-8 protocols. And by using 'ascii', at least we'd give a
runtime error and not put possibly bogus data into the resulting binary
string. Granted, the tradeoff is that we now have a case where whether
or not the code raises an exception is dependent upon the values being
formatted. If 'ascii' is the default, we could later switch to 'utf-8',
but we couldn't go the other way.

The only place this is likely to be a problem is when formatting unicode
string values. No other built-in type is going to have a non-ascii
compatible character in its __format__, unless you do tricky things with
datetime format_specs. Of course user-defined types can return any
unicode chars from __format__.

Once we make a decision, I can apply the same logic to b''.format(), if
that's desirable.

I'm open to suggestions on this.

Thanks for reading.

-- 
Eric.