[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Steven D'Aprano
steve at pearwood.info
Sat Jan 11 16:38:39 CET 2014
On Sat, Jan 11, 2014 at 01:56:56PM +0100, Juraj Sukop wrote:
> On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano <steve at pearwood.info>wrote:
> > If you consider PDF as binary with occasional pieces of ASCII text, then
> > working with bytes makes sense. But I wonder whether it might be better
> > to consider PDF as mostly text with some binary bytes. Even though the
> > bulk of the PDF will be binary, the interesting bits are text. E.g. your
> > example:
10 0 obj
<< /Type /XObject
/Width 100
/Height 100
/Alternates 15 0 R
/Length 2167
>>
stream
...binary image data...
endstream
endobj
> > Even though the binary image data is probably much, much larger in
> > length than the text shown above, it's (probably) trivial to deal with:
> > convert your image data into bytes, decode those bytes into Latin-1,
> > then concatenate the Latin-1 string into the text above.
>
> This is similar to what Chris Barker suggested. I also don't try to be
> difficult here but please explain to me one thing. To treat bytes as if
> they were Latin-1 is bad idea,
Correct. Bytes are not Latin-1. Here are some bytes which represent a
word I extracted from a text file on my computer:
b'\x8a\x75\xa7\x65\x72\x73\x74'
If you imagine that they are Latin-1, you might think that the word
is a C1 control character ("VTS", or Vertical Tabulation Set) followed
by "u§erst", but it is not. It is actually the German word "äußerst"
("extremely"), and the text file was generated on a 1990s vintage
Macintosh using the MacRoman "extended ASCII" code page.
> that's why "%f" got dropped in the first
> place, right? How is it then alright to put an image inside an Unicode
> string?
The point that I am making is that many people want to add formatting
operations to bytes so they can put ASCII strings inside bytes. But (as
far as I can tell) they don't need to do this, because they can treat
Unicode strings containing code points U+0000 through U+00FF (i.e. the
same range as handled by Latin-1) as if they were bytes. This gives you:
- convenient syntax, no need to prefix strings with b;
- mostly avoid needing to decode and encode strings, except at a
few points in your code;
- the full set of string methods;
- can easily include arbitrary octal or hex byte values, using \o and
\x escapes;
- error checking: when you finally encode the text to bytes before
writing to a file, or sending over a wire, any code-point greater
than U+00FF will give you an exception unless explicitly silenced.
No need to wait for Python 3.5 to come out, you can do this *right now*.
Of course, this is a little bit "unclean", it breaks the separation of
text and bytes by treating bytes *as if* they were Unicode code points,
which they are not, but I believe that this is a practical technique
which is not too hard to deal with. For instance, suppose I have a
mixed format which consists of an ASCII tag, a number written in ASCII,
a NULL separator, and some binary data:
# Using bytes
values = [29460, 29145, 31098, 27123]
blob = b"".join(struct.pack(">h", n) for n in values)
data = b"Tag:" + str(len(values)).encode('ascii') + b"\0" + blob
=> gives data = b'Tag:4\x00s\x14q\xd9yzi\xf3'
That's a bit ugly, but not too ugly. I could write code like that. But
if bytes had % formatting, I might write this instead:
data = b"Tag:%d\0%s" % (len(values), blob)
This is a small improvement, but I can't use it until Python 3.5 comes
out. Or I could do this right now:
# Using text
values = [29460, 29145, 31098, 27123]
blob = b"".join(struct.pack(">h", n) for n in values)
data = "Tag:%d\0%s" % (len(values), blob.decode('latin-1'))
=> gives data = 'Tag:4\x00s\x14qÙyzió'
When I'm ready to transmit this over the wire, or write to disk, then I
encode, and get:
data.encode('latin-1')
=> b'Tag:4\x00s\x14q\xd9yzi\xf3'
which is exactly the same as I got in the first place. In this case, I'm
not using Latin-1 for the semantics of bytes to characters (e.g. byte
\xf3 = char ó), but for the useful property that all 256 distinct bytes
are valid in Latin-1. Any other encoding with the same property will do.
It is a little unfortunate that struct gives bytes rather than a str,
but you can hide that with a simple helper function:
def b2s(bytes):
return bytes.decode('latin1')
data = "Tag:%d\0%s" % (len(values), b2s(blob))
> Also, apart from the in/out conversions, do any other difficulties come to
> your mind?
No. If you accidentally introduce a non-Latin1 code point, when you
decode you'll get an exception.
--
Steven
More information about the Python-Dev
mailing list