[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
steve at pearwood.info
Sat Jan 11 19:36:32 CET 2014
On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
> On 01/11/2014 07:38 AM, Steven D'Aprano wrote:
> >The point that I am making is that many people want to add formatting
> >operations to bytes so they can put ASCII strings inside bytes. But (as
> >far as I can tell) they don't need to do this, because they can treat
> >Unicode strings containing code points U+0000 through U+00FF (i.e. the
> >same range as handled by Latin-1) as if they were bytes.
> So instead of blurring the line between bytes and text, you're blurring the
> line between text and bytes (with a few extra seat belts thrown in).
I'm not blurring anything. The people who designed the file format that
mixes textual data and binary data did the blurring. Given that such
formats exist, it is inevitable that we need to put text into bytes, or
bytes into text. The situation is already blurred, we just have to
decide how to handle it. There are three broad strategies:
1) Make bytes more string-like, so that we can process our data as
bytes, but still do string operations on the bits that are ASCII.
2) Make strings more byte-like, so that we can process our data as
strings, but do byte operations (like bit mask operations) on the parts
that are binary data.
3) Don't do either. Keep the text parts of your data as text, and the
binary parts of your data as bytes. Do your text operations on text, and
your byte operations on bytes.
At some point, of course, they need to be combined. We have a choice:
* Right now, we can use text as the base, and combine bytes into the
text using Latin-1, and it Just Works.
* Or we can wait until (maybe) Python 3.5, when (perhaps) bytes objects
will be more text-like, and then use bytes as the base, and (with
luck) it Should Just Work.
There's another disadvantage with the second: treating bytes as if they
were ASCII by default reinforces the same old harmful paradigm that text
is ASCII that we're trying to get away from. That's a bad, painful idea
that causes a lot of problems and buggy code, and should be resisted.
On the other hand, embedding arbitrary binary data in Unicode text
doesn't reinforce any common or harmful paradigms. It just requires the
programmer to forget about characters and concentrate on code points,
since Latin-1 maps bytes to code points in a very convenient way:
Byte 0x00 maps to code point U+0000
Byte 0x01 maps to code point U+0001
Byte 0x02 maps to code point U+0002
Byte 0xFF maps to code point U+00FF
So to embed the binary data 0xDEADBEEF in your string, you can just use
'\xDE\xAD\xBE\xEF' regardless of what character those code points happen
If we are manipulating data *as if it were text*, then we ought to treat
it as text, not add methods to bytes that makes bytes text-like. If we
are manipulating data *as if it were bytes*, doing byte-manipulation
operations like bit-masking, then we ought to treat it as numeric bytes,
not add numeric methods to text. Is that really a controversial opinion?
> Besides being a bit awkward, this also means that any encoded text (even
> the plain ASCII stuff) is now being transformed three times instead of one:
> unicode to bytes
> bytes to unicode using latin1
> unicode to bytes
Where do you get this from? I don't follow your logic. Start with a text
template = """\xDE\xAD\xBE\xEF
blah blah blah
data = template % ("George", 42, blob.decode('latin-1'))
Only the binary blobs need to be decoded. We don't need to encode the
template to bytes, and the textual data doesn't get encoded until we're
ready to send it across the wire or write it to disk. And when we do,
since all the code points are in the range U+0000 to U+00FF, encoding it
to Latin-1 ought to be a fast, efficient operation, possibly even just a
It's true that the individual binary data fields will been to be decoded
from bytes, but unless you want Python to guess an encoding (which is
the old broken Python 2 model), you're going to have to do that
> Even if the cost of moving those bytes around is cheap, it's not free.
> When you're creating hundreds of PDFs at a time that's going to make a
You've profiled it? Unless you've measured it, it doesn't exist. I'm not
going to debate performance penalties of code you haven't written yet.
More information about the Python-Dev