[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Steven D'Aprano steve at pearwood.info
Sat Jan 11 19:36:32 CET 2014


On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
> On 01/11/2014 07:38 AM, Steven D'Aprano wrote:
> >
> >The point that I am making is that many people want to add formatting
> >operations to bytes so they can put ASCII strings inside bytes. But (as
> >far as I can tell) they don't need to do this, because they can treat
> >Unicode strings containing code points U+0000 through U+00FF (i.e. the
> >same range as handled by Latin-1) as if they were bytes.
> 
> So instead of blurring the line between bytes and text, you're blurring the 
> line between text and bytes (with a few extra seat belts thrown in).  

I'm not blurring anything. The people who designed the file format that 
mixes textual data and binary data did the blurring. Given that such 
formats exist, it is inevitable that we need to put text into bytes, or 
bytes into text. The situation is already blurred, we just have to 
decide how to handle it. There are three broad strategies:

1) Make bytes more string-like, so that we can process our data as 
bytes, but still do string operations on the bits that are ASCII.

2) Make strings more byte-like, so that we can process our data as 
strings, but do byte operations (like bit mask operations) on the parts 
that are binary data.

3) Don't do either. Keep the text parts of your data as text, and the 
binary parts of your data as bytes. Do your text operations on text, and 
your byte operations on bytes.

At some point, of course, they need to be combined. We have a choice:

* Right now, we can use text as the base, and combine bytes into the 
  text using Latin-1, and it Just Works.

* Or we can wait until (maybe) Python 3.5, when (perhaps) bytes objects 
  will be more text-like, and then use bytes as the base, and (with 
  luck) it Should Just Work.


There's another disadvantage with the second: treating bytes as if they 
were ASCII by default reinforces the same old harmful paradigm that text 
is ASCII that we're trying to get away from. That's a bad, painful idea 
that causes a lot of problems and buggy code, and should be resisted.

On the other hand, embedding arbitrary binary data in Unicode text 
doesn't reinforce any common or harmful paradigms. It just requires the 
programmer to forget about characters and concentrate on code points, 
since Latin-1 maps bytes to code points in a very convenient way:

Byte 0x00 maps to code point U+0000
Byte 0x01 maps to code point U+0001
Byte 0x02 maps to code point U+0002
...
Byte 0xFF maps to code point U+00FF


So to embed the binary data 0xDEADBEEF in your string, you can just use 
'\xDE\xAD\xBE\xEF' regardless of what character those code points happen 
to be.

If we are manipulating data *as if it were text*, then we ought to treat 
it as text, not add methods to bytes that makes bytes text-like. If we 
are manipulating data *as if it were bytes*, doing byte-manipulation 
operations like bit-masking, then we ought to treat it as numeric bytes, 
not add numeric methods to text. Is that really a controversial opinion?


> Besides being a bit awkward, this also means that any encoded text (even 
> the plain ASCII stuff) is now being transformed three times instead of one:
> 
>   unicode to bytes
>   bytes to unicode using latin1
>   unicode to bytes

Where do you get this from? I don't follow your logic. Start with a text 
template:

template = """\xDE\xAD\xBE\xEF
Name:\0\0\0%s
Age:\0\0\0\0%d
Data:\0\0\0%s
blah blah blah
"""

data = template % ("George", 42, blob.decode('latin-1'))

Only the binary blobs need to be decoded. We don't need to encode the 
template to bytes, and the textual data doesn't get encoded until we're 
ready to send it across the wire or write it to disk. And when we do, 
since all the code points are in the range U+0000 to U+00FF, encoding it 
to Latin-1 ought to be a fast, efficient operation, possibly even just a 
mem copy.

It's true that the individual binary data fields will been to be decoded 
from bytes, but unless you want Python to guess an encoding (which is 
the old broken Python 2 model), you're going to have to do that 
regardless.


> Even if the cost of moving those bytes around is cheap, it's not free.  
> When you're creating hundreds of PDFs at a time that's going to make a 
> difference.

You've profiled it? Unless you've measured it, it doesn't exist. I'm not 
going to debate performance penalties of code you haven't written yet.



-- 
Steven


More information about the Python-Dev mailing list