byteformat() proposal: please critique
The following function interpolates bytes, bytearrays, and formatted strings, the latter two auto-converted to bytes, into a bytes (or auto-converted bytearray) format. This function automates much of what some people have recommended for combining ascii text and binary blogs. The test passes on 2.7.6 as well as 3.3.3, though a 2.7-only version would be simpler. =============== # bf.py -- Terry Jan Reedy, 2014 Jan 11 "Define byteformat(): a bytes version of str.format as a function." import re def byteformat(form, obs): '''Return bytes-formated objects interpolated into bytes format. The bytes or bytearray format has two types of replacement fields. b'{}' and b'{:}': The object can be any raw bytes or bytearray object. b'{:<format_spec>}: The object can by any object ob that can be string-formated with <format_spec>. Bytearray are converted to bytes. The text encoding is the default (encoding="utf-8", errors="strict"). Users should be explicitly encode to bytes for any other encoding. The struct module can by used to produce bytes, such as binary-formated integers, that are not encoded text. Test passes on both 2.7.6 and 3.3.3. ''' if isinstance(form, bytearray): form = bytes(form) fields = re.split(b'{:?([^}]*)}', form) # print(fields) if len(fields) != 2*len(obs)+1: raise ValueError('Number of replacement fields not same as len(obs)') j = 1 # index into fields for ob in obs: if isinstance(ob, bytearray): ob = bytes(ob) field = fields[j] fields[j] = format(ob, field.decode()).encode() if field else ob j += 2 return b''.join(fields) # test code bformat = b"bytes: {}; bytearray: {:}; unicode: {:s}; int: {:5d}; float: {:7.2f}; end" objects = (b'abc', bytearray(b'def'), u'ghi', 123, 12.3) result = byteformat(bformat, objects) result2 = byteformat(bytearray(bformat), objects) strings = (ob.decode() if isinstance(ob, (bytes, bytearray)) else ob for ob in objects) expect = bformat.decode().format(*strings).encode() #print(result) #print(result2) print(expect) assert result == result2 == expect ===== This has been edited from what I posted to issue 3982 to expand the docstrings and to work the same with both bytes and bytearrays on both 2.7 and 3.3. When I posted before, I though of it merely as a proof-of-concept prototype. After reading the seemingly endless discussion of possible variations of byte formatting with % and .format, I now present it as a real, concrete, proposal. There are, of course, details that could be tweaked. The encoding uses the default, which on 3.x is (encoding='utf-8', errors='strict'). This could be changed to an explicit encoding='ascii'. If that were done, the encoding could be made a parameter that defaults to 'ascii'. The joiner could be defined as type(form)() so the output type matches the input form type. I did not do that because it complicates the test. The coercion of interpolated bytearray objects to bytes is needed for 2.7 because in 2.7, str/bytes.join raises TypeError for bytearrays in the input sequence. A 3.x-only version could drop this. One objection to the function is that it is neither % or .format. To me, this is an advantage in that a new function will not be expected to exactly match the % or .format behavior in either 2.x or 3.x. It eliminates the 'matching the old' arguments so we can focus on what actual functionality is needed. There is no need to convert true binary bytes to text with either latin-1 or surrogates. There is no need to add anything to bytes. The code above uses the built-in facilities that we already have, which to me should be the first thing to try, not the last. One new feature that does not match old behavior is that {} and {:} are changed (in 3.x) to indicate bytes whereas {:s} continues to indicate (in 3.x) unicode text. ({:s} might be changed to mean unicode for 2.7 also, but I did not explore that idea.) Similarly, a new function is free to borrow only the format_spec part of replace of replacement fields and use format(ob, format_spec) to format each object. Anyone who needs the full power of str.format is free to use it explicitly. I think format_specs cover most of what people have asked for. For future releases, the function could go in the string module. It could otherwise be added to existing or future 2&3 porting packages. -- Terry Jan Reedy
On Sat, Jan 11, 2014 at 8:20 PM, Terry Reedy <tjreedy@udel.edu> wrote:
The following function interpolates bytes, bytearrays, and formatted strings, the latter two auto-converted to bytes, into a bytes (or auto-converted bytearray) format. This function automates much of what some people have recommended for combining ascii text and binary blogs. The test passes on 2.7.6 as well as 3.3.3, though a 2.7-only version would be simpler. ===============
# bf.py -- Terry Jan Reedy, 2014 Jan 11 "Define byteformat(): a bytes version of str.format as a function." import re
def byteformat(form, obs): '''Return bytes-formated objects interpolated into bytes format.
The bytes or bytearray format has two types of replacement fields. b'{}' and b'{:}': The object can be any raw bytes or bytearray object. b'{:<format_spec>}: The object can by any object ob that can be string-formated with <format_spec>. Bytearray are converted to bytes.
The text encoding is the default (encoding="utf-8", errors="strict"). Users should be explicitly encode to bytes for any other encoding. The struct module can by used to produce bytes, such as binary-formated integers, that are not encoded text.
Test passes on both 2.7.6 and 3.3.3. '''
if isinstance(form, bytearray): form = bytes(form) fields = re.split(b'{:?([^}]*)}', form) # print(fields) if len(fields) != 2*len(obs)+1: raise ValueError('Number of replacement fields not same as len(obs)') j = 1 # index into fields for ob in obs: if isinstance(ob, bytearray): ob = bytes(ob) field = fields[j] fields[j] = format(ob, field.decode()).encode() if field else ob j += 2 return b''.join(fields)
# test code bformat = b"bytes: {}; bytearray: {:}; unicode: {:s}; int: {:5d}; float: {:7.2f}; end" objects = (b'abc', bytearray(b'def'), u'ghi', 123, 12.3) result = byteformat(bformat, objects) result2 = byteformat(bytearray(bformat), objects) strings = (ob.decode() if isinstance(ob, (bytes, bytearray)) else ob for ob in objects) expect = bformat.decode().format(*strings).encode()
#print(result) #print(result2) print(expect) assert result == result2 == expect
===== This has been edited from what I posted to issue 3982 to expand the docstrings and to work the same with both bytes and bytearrays on both 2.7 and 3.3. When I posted before, I though of it merely as a proof-of-concept prototype. After reading the seemingly endless discussion of possible variations of byte formatting with % and .format, I now present it as a real, concrete, proposal.
There are, of course, details that could be tweaked. The encoding uses the default, which on 3.x is (encoding='utf-8', errors='strict'). This could be changed to an explicit encoding='ascii'. If that were done, the encoding could be made a parameter that defaults to 'ascii'. The joiner could be defined as type(form)() so the output type matches the input form type. I did not do that because it complicates the test.
With that flexibility this matches what I have been mulling in the back of my head all day. Basically everything that goes in is assumed to be bytes unless {:s} says to expect something which can be passed to str() and then use some specified encoding in all instances (stupid example following as it might be easier with bytes.join, but it gets the point across):: formatter = format_bytes('latin1', 'strict') http_response = formatter(b'Content-Type: {:s}\r\n\r\nContent-Length: {:s}\r\n\r\n{}', 'image/jpeg', len(data), data) Nothing fancy, just an easy way to handle having to call str.encode() on every text argument that is to end up as bytes as Terry is proposing (and I'm fine with defaulting to ASCII/strict with no arguments). Otherwise you do what R. David Murray suggested and just have people rely on their own API which accepts what they want and then spits out what they want behind the scenes. It basically comes down to how much tweaking of existing Python 2.7 %/.format() calls people will be expected to make. I'm fine with asking people to call a function like what Terry is proposing as it can do away with baking in that ASCII is reasonable as well as not require a bunch of work without us having to argue over what bytes.format() should or should not do. Personally I say bytes.format() is fine but it shouldn't do any text encoding which makes its usefulness rather minor (much like the other text-like methods that got carried forward in hopes that they would be useful to people porting code; maybe we should consider taking them out in Python 4 or something if we find out no one is using them).
The coercion of interpolated bytearray objects to bytes is needed for 2.7 because in 2.7, str/bytes.join raises TypeError for bytearrays in the input sequence. A 3.x-only version could drop this.
One objection to the function is that it is neither % or .format. To me, this is an advantage in that a new function will not be expected to exactly match the % or .format behavior in either 2.x or 3.x. It eliminates the 'matching the old' arguments so we can focus on what actual functionality is needed.
Agreed.
There is no need to convert true binary bytes to text with either latin-1 or surrogates. There is no need to add anything to bytes. The code above uses the built-in facilities that we already have, which to me should be the first thing to try, not the last.
I think we are all losing sight of the fact that we are talking about Python 3.5 here. Even with an accelerated release schedule of a year that is still a year away! I think any proposal being made should be prototyped in pure Python and tried on a handful or real world examples to see how the results end up looking like to measure how useful they are on their own and how much work it is to port to using it. I think the goal should be a balance and not going to an extreme to minimize porting work from Python 2.7 at the cost of polluting the bytes/string separation and letting people entirely ignore encoding of strings.
One new feature that does not match old behavior is that {} and {:} are changed (in 3.x) to indicate bytes whereas {:s} continues to indicate (in 3.x) unicode text. ({:s} might be changed to mean unicode for 2.7 also, but I did not explore that idea.) Similarly, a new function is free to borrow only the format_spec part of replace of replacement fields and use format(ob, format_spec) to format each object. Anyone who needs the full power of str.format is free to use it explicitly. I think format_specs cover most of what people have asked for.
For future releases, the function could go in the string module. It could otherwise be added to existing or future 2&3 porting packages.
I don't think the string module is the right place since this is meant to operate on bytes, but then again I don't know where it would end up if it went into the stdlib. If we have it take the string encoding arguments it could be a method on the bytes type by being a factory method:: formatter = bytes.formatter('latin1', 'strict') ... I would be willing to go as far as making 'strict' the default 'error' argument, but I would say it's still go to make people specify even 'ascii', otherwise people lose sight that bytes([ord(1)]) == b'1' == '1'.encode('ascii') != 1 .to_bytes(1, 'big') and that is a key thing to grasp.
On 12 January 2014 12:13, Brett Cannon <brett@python.org> wrote:
With that flexibility this matches what I have been mulling in the back of my head all day. Basically everything that goes in is assumed to be bytes unless {:s} says to expect something which can be passed to str() and then use some specified encoding in all instances (stupid example following as it might be easier with bytes.join, but it gets the point across)::
formatter = format_bytes('latin1', 'strict') http_response = formatter(b'Content-Type: {:s}\r\n\r\nContent-Length: {:s}\r\n\r\n{}', 'image/jpeg', len(data), data)
Nothing fancy, just an easy way to handle having to call str.encode() on every text argument that is to end up as bytes as Terry is proposing (and I'm fine with defaulting to ASCII/strict with no arguments). Otherwise you do what R. David Murray suggested and just have people rely on their own API which accepts what they want and then spits out what they want behind the scenes.
It basically comes down to how much tweaking of existing Python 2.7 %/.format() calls people will be expected to make. I'm fine with asking people to call a function like what Terry is proposing as it can do away with baking in that ASCII is reasonable as well as not require a bunch of work without us having to argue over what bytes.format() should or should not do. Personally I say bytes.format() is fine but it shouldn't do any text encoding which makes its usefulness rather minor (much like the other text-like methods that got carried forward in hopes that they would be useful to people porting code; maybe we should consider taking them out in Python 4 or something if we find out no one is using them).
There are several that are useful for manipulating binary data *as binary data*, including some of those that assume ASCII compatibility. Even some of the odd ones (like bytes.title) which we considered deprecating around 3.2 or so (if I recall correctly) were left because they're useful for HTTP style headers. The thing about them all is that even though they do assume ASCII compatibility, they don't do any implicit conversions between raw bytes and other formats - they're all purely about transforming binary data. PEP 460 as it currently stands is in the same vein - it doesn't blur the lines between binary data and other formats, but it *does* make binary data easier to work with, and in a way that is a subset of what Python 2 8-bit strings allowed, further increasing the size of the Python 2/3 source compatible subset. The line that is crossed by suggestions like including number formatting in PEP 460 is that those suggestions *do* introduce implicit encoding from structured semantic data (a numeric value) to a serialised format (the ASCII text representation of that number). Implicitly encoding text (even with the ASCII codec and strict error handling) similarly blurs the line between binary and text data again, and is the kind of change that gets rejected as attempting to reintroduce the Python 2 text model back into the Python 3 core types. That said, while I don't think such a hybrid type is appropriate as part of the *core* text model, I agree that such a type *could* be useful when implementing protocol handling code. That's why I suggested "asciicompat" to Benno as the package name for the home of asciistr - I think it could be a good home for various utilities designed for working with ASCII compatible binary protocols using a more text-like API than that offered by the bytes type in Python 3. I actually see much of this debate as akin to that over the API changes between Google's original ipaddr module and the ipaddress API in the standard library. The original ipaddr API is fine *if you already know how IP networks work* - it plays fast and loose with terminology, but in a way that you can deal with if you already know the real meaning of the underlying concepts. However, anyone attempting to go the other way (learning IP networking concepts from the ipaddr API) will be hopelessly, hopelessly confused because the terminology is used in *very* loose ways. So ipaddress tightened things up and made the names more formally correct, aiming to make it usable both as an address manipulation library *and* as a way of learning the underlying IP addressing concepts. I see the Python 2 str type as similar to the ipaddr API - if you already know what you're doing when it comes to Unicode, then it's pretty easy to work with. However, if you're trying to use it to *learn* Unicode concepts, then you're pretty much stuffed, as you get lost in a mazy of twisty values, as the same data type is used with very different semantics, depending on which end of a data transformation you're on (although sometimes you'll get a different data type, depending on the data *values* involved). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (4)
-
Brett Cannon
-
Ethan Furman
-
Nick Coghlan
-
Terry Reedy