RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

(Sorry if this messes-up the thread order, it is meant as a reply to the original RFC.) Dear list, newbie here. After much hesitation I decided to put forward a use case which bothers me about the current proposal. Disclaimer: I happen to write a library which is directly influenced by this. As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example "100" or "1.23". However, the proposal drops "%d", "%f" and "%x" formats and the suggested workaround for writing down a number is to use ".encode('ascii')", which I think has two problems: One is that it needs to construct one additional object per formatting as opposed to Python 2; it is not uncommon for a PDF file to contain millions of numbers. The second problem is that, in my eyes, it is very counter-intuitive to require the use of str only to get formatting on bytes. Consider the case where a large bytes object is created out of many smaller bytes objects. If I wanted to format a part I had to use str instead. For example: content = b''.join([ b'header', b'some dictionary structure', b'part 1 abc', ('part 2 %.3f' % number).encode('ascii'), b'trailer']) In the case of PDF, the embedding of an image into PDF looks like: 10 0 obj << /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 >> stream ...binary image data... endstream endobj Because of the image it makes sense to store such structure inside bytes. On the other hand, there may well be another "obj" which contains the coordinates of Bezier paths: 11 0 obj ... stream 0.5 0.1 0.2 RG 300 300 m 300 400 400 400 400 300 c b endstream endobj To summarize, there are cases which mix "binary" and "text" and, in my opinion, dropping the bytes-formatting of numbers makes it more complicated than it was. I would appreciate any explanation on how: b'%.1f %.1f %.1f RG' % (r, g, b) is more confusing than: b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r, g, b))) Similar situation exists for HTTP ("Content-Length: 123") and ASCII STL ("vertex 1.0 0.0 0.0"). Thanks and have a nice day, Juraj Sukop PS: In the case the proposal will not include the number formatting, it would be nice to list there a set of guidelines or examples on how to proceed with porting Python 2 formats to Python 3.

On 1/10/2014 12:17 PM, Juraj Sukop wrote:
(Sorry if this messes-up the thread order, it is meant as a reply to the original RFC.)
Dear list,
newbie here. After much hesitation I decided to put forward a use case which bothers me about the current proposal. Disclaimer: I happen to write a library which is directly influenced by this.
As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example "100" or "1.23".
However, the proposal drops "%d", "%f" and "%x" formats and the suggested workaround for writing down a number is to use ".encode('ascii')", which I think has two problems:
One is that it needs to construct one additional object per formatting as opposed to Python 2; it is not uncommon for a PDF file to contain millions of numbers.
The second problem is that, in my eyes, it is very counter-intuitive to require the use of str only to get formatting on bytes. Consider the case where a large bytes object is created out of many smaller bytes objects. If I wanted to format a part I had to use str instead. For example:
content = b''.join([ b'header', b'some dictionary structure', b'part 1 abc', ('part 2 %.3f' % number).encode('ascii'), b'trailer'])
I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. Since converting int and float to strings generates a very small range of ASCII characters, ([0-9a-fx.-=], plus the uppercase versions), what problem is introduced by allowing int and float? The original str.format() work relied on this fact in its stringlib implementation. Eric.

Am 10.01.2014 18:56, schrieb Eric V. Smith:
On 1/10/2014 12:17 PM, Juraj Sukop wrote:
(Sorry if this messes-up the thread order, it is meant as a reply to the original RFC.)
Dear list,
newbie here. After much hesitation I decided to put forward a use case which bothers me about the current proposal. Disclaimer: I happen to write a library which is directly influenced by this.
As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example "100" or "1.23".
However, the proposal drops "%d", "%f" and "%x" formats and the suggested workaround for writing down a number is to use ".encode('ascii')", which I think has two problems:
One is that it needs to construct one additional object per formatting as opposed to Python 2; it is not uncommon for a PDF file to contain millions of numbers.
The second problem is that, in my eyes, it is very counter-intuitive to require the use of str only to get formatting on bytes. Consider the case where a large bytes object is created out of many smaller bytes objects. If I wanted to format a part I had to use str instead. For example:
content = b''.join([ b'header', b'some dictionary structure', b'part 1 abc', ('part 2 %.3f' % number).encode('ascii'), b'trailer'])
I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion. Since converting int and float to strings generates a very small range of ASCII characters, ([0-9a-fx.-=], plus the uppercase versions), what problem is introduced by allowing int and float? The original str.format() work relied on this fact in its stringlib implementation.
I agree. I would have needed bytes-formatting (with numbers) recently writing .rtf files. Georg

On Fri, 10 Jan 2014 12:56:19 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion.
If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else. Regards Antoine.

On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 12:56:19 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion.
If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else.
It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types. Eric.

On Fri, 10 Jan 2014 17:33:57 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 12:56:19 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion.
If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else.
It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types.
That's not what I'm saying. str.__mod__ is able to represent all kinds of types through %s and calling __str__. It doesn't make sense for bytes.__mod__ to only support int and float. Why only them? Regards Antoine.

On 01/10/2014 02:42 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 17:33:57 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 12:56:19 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion.
If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else.
It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types.
That's not what I'm saying. str.__mod__ is able to represent all kinds of types through %s and calling __str__. It doesn't make sense for bytes.__mod__ to only support int and float. Why only them?
Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation? -- ~Ethan~

On Fri, 10 Jan 2014 14:58:15 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/10/2014 02:42 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 17:33:57 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 12:56:19 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion.
If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else.
It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types.
That's not what I'm saying. str.__mod__ is able to represent all kinds of types through %s and calling __str__. It doesn't make sense for bytes.__mod__ to only support int and float. Why only them?
Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation?
Again, if you're representing "ASCII", you're representing text and should use a str object. Regards Antoine.

On 1/10/2014 6:02 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 14:58:15 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/10/2014 02:42 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 17:33:57 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 12:56:19 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion.
If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else.
It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types.
That's not what I'm saying. str.__mod__ is able to represent all kinds of types through %s and calling __str__. It doesn't make sense for bytes.__mod__ to only support int and float. Why only them?
Ah, I see. This is about the types that %s supports, not about support for %d and %f.
Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation?
Again, if you're representing "ASCII", you're representing text and should use a str object.
Yes, but is there existing 2.x code that uses %s for int and float (perhaps unwittingly), and do we want to "help" that code out? Or do we want to make porters first change to using %d or %f instead of %s? I'll grant you that we might be doing more harm than help by special-casing these types. I'm just asking. I think what you're getting at is that in addition to not calling __format__, we don't want to call __str__, either, for the same reason. Correct me if I'm off base, please. I'm not trying to put words in anyone's mouth. In any event, I think supporting %d and %f (and %i, %u, %x, %g, etc.) inside format strings would be useful. Eric.

On Fri, 10 Jan 2014 18:14:45 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation?
Again, if you're representing "ASCII", you're representing text and should use a str object.
Yes, but is there existing 2.x code that uses %s for int and float (perhaps unwittingly), and do we want to "help" that code out? Or do we want to make porters first change to using %d or %f instead of %s?
I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and %f on bytes objects.
I think what you're getting at is that in addition to not calling __format__, we don't want to call __str__, either, for the same reason.
Not only. We don't want to do anything that actually asks for a *textual* representation of something. %d and %f ask for a textual representation of a number, so they're right out. Regards Antoine.

To avoid implicit conversion between str and bytes, I propose adding only limited %-format, not .format() or .format_map(). "limited %-format" means: %c accepts integer or bytes having one length. %r is not supported %s accepts only bytes. %a is only format accepts arbitrary object. And other formats is same to str. On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Fri, 10 Jan 2014 18:14:45 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
Because embedding the ASCII equivalent of ints and floats in byte
streams
is a common operation?
Again, if you're representing "ASCII", you're representing text and should use a str object.
Yes, but is there existing 2.x code that uses %s for int and float (perhaps unwittingly), and do we want to "help" that code out? Or do we want to make porters first change to using %d or %f instead of %s?
I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and %f on bytes objects.
I think what you're getting at is that in addition to not calling __format__, we don't want to call __str__, either, for the same reason.
Not only. We don't want to do anything that actually asks for a *textual* representation of something. %d and %f ask for a textual representation of a number, so they're right out.
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
-- INADA Naoki <songofacandy@gmail.com>

To avoid implicit conversion between str and bytes, I propose adding only limited %-format, not .format() or .format_map(). "limited %-format" means: %c accepts integer or bytes having one length. %r is not supported %s accepts only bytes. %a is only format accepts arbitrary object. And other formats is same to str. On Sat, Jan 11, 2014 at 8:24 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Fri, 10 Jan 2014 18:14:45 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
Because embedding the ASCII equivalent of ints and floats in byte
streams
is a common operation?
Again, if you're representing "ASCII", you're representing text and should use a str object.
Yes, but is there existing 2.x code that uses %s for int and float (perhaps unwittingly), and do we want to "help" that code out? Or do we want to make porters first change to using %d or %f instead of %s?
I'm afraid you're misunderstanding me. The PEP doesn't allow for %d and %f on bytes objects.
I think what you're getting at is that in addition to not calling __format__, we don't want to call __str__, either, for the same reason.
Not only. We don't want to do anything that actually asks for a *textual* representation of something. %d and %f ask for a textual representation of a number, so they're right out.
Regards
Antoine.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
-- INADA Naoki <songofacandy@gmail.com>

On 11 January 2014 08:58, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/10/2014 02:42 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 17:33:57 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
On 1/10/2014 5:29 PM, Antoine Pitrou wrote:
On Fri, 10 Jan 2014 12:56:19 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
I agree. I don't see any reason to exclude int and float. See Guido's messages http://bugs.python.org/issue3982#msg180423 and http://bugs.python.org/issue3982#msg180430 for some justification and discussion.
If you are representing int and float, you're really formatting a text message, not bytes. Basically if you allow the formatting of int and float instances, there's no reason not to allow the formatting of arbitrary objects through __str__. It doesn't make sense to special-case those two types and nothing else.
It might not for .format(), but I'm not convinced. But for %-formatting, str is already special-cased for these types.
That's not what I'm saying. str.__mod__ is able to represent all kinds of types through %s and calling __str__. It doesn't make sense for bytes.__mod__ to only support int and float. Why only them?
Because embedding the ASCII equivalent of ints and floats in byte streams is a common operation?
It's emphatically *NOT* a binary interpolation operation though - the binary representation of the integer 1 is the byte value 1, not the byte value 49. If you want the byte value 49 to appear in the stream, then you need to interpolate the *ASCII encoding* of the string "1", not the integer 1. If you want to manipulate text representations, do it in the text domain. If you want to manipulate binary representations, do it in the binary domain. The *whole point* of the text model change in Python 3 is to force programmers to *decide* which domain they're operating in at any given point in time - while the approach of blurring the boundaries between the two can be convenient for wire protocol and file format manipulation, it is a horrendous bug magnet everywhere else. PEP 360 is just about adding back some missing functionality in the binary domain (interpolating binary sequences together), not about bringing back the problematic text model that allows particular text representations to be interpreted as if they were also binary data. That said, I actually think there's a valid use case for a Python 3 type that allows the bytes/text boundary to be blurred in making it easier to port certain kinds of Python 2 code to Python 3 (specifically, working with wire protocols and file formats that contain a mixture of encodings, but all encodings are *known* to at least be ASCII compatible). It is highly unlikely that such a type will *ever* be part of the standard library, though - idiomatic Python 3 code shouldn't need it, affected Python 2 code *can* be ported without it (but may look more complicated due to the use of explicit decoding and encoding operations, rather than relying on implicit ones), and it should be entirely possible to implement it as an extension module (modulo one bug in CPython that may impact the approach, but we won't know for sure until people actually try it out). Fortunately, after years of my suggesting the idea to almost everyone that complained about the move away from the broken POSIX text model in Python 3, Benno Rice has started experimenting with such a type based on a preliminary test case I wrote at linux.conf.au last week: https://github.com/jeamland/asciicompat/blob/master/tests/ncoghlan.py Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop <juraj.sukop@gmail.com> wrote:
As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example "100" or "1.23".
Just to be clear here -- is PDF specifically bytes+ascii? Or could there be some-other-encoding unicode in there? If so, then you really have a mess! if it is bytes+ascii, then it seems you could use a unicode object and encode/decode to latin-1 Perhaps still a bit klunkier than formatting directly into a bytes object, but workable. b'%.1f %.1f %.1f RG' % (r, g, b)
is more confusing than:
b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r, g, b)))
Let's see, I think that would be: u'%.1f %.1f %.1f RG' % (r, g, b) then when you want to write it out: .encode('latin-1') dumping the binary data in would be a bit uglier, for teh image example: stream ...binary image data... endstream endobj u"stream\n%s\nendstream\nendobj"%binary_data.decode('latin-1') I think..... not too bad, though if nothing else an alias for latin-1 that made it clear it worked for this would be nice. maybe ascii_plus_binary or something? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Fri, Jan 10, 2014 at 10:52 PM, Chris Barker <chris.barker@noaa.gov>wrote:
On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop <juraj.sukop@gmail.com>wrote:
As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example "100" or "1.23".
Just to be clear here -- is PDF specifically bytes+ascii?
Or could there be some-other-encoding unicode in there?
From the specs: "At the most fundamental level, a PDF file is a sequence of 8-bit bytes." But it is also possible to represent a PDF using printable ASCII + whitespace by using escapes and "filters". Then, there are also "text strings" which might be encoded in UTF+16.
What this all means is that the PDF objects are expressed in ASCII, "stream" objects like images and fonts may have a binary part and I never saw those UTF+16 strings. u"stream\n%s\nendstream\nendobj"%binary_data.decode('latin-1')
The argument for dropping "%f" et al. has been that if something is a text, then it should be Unicode. Conversely, if it is not text, then it should not be Unicode.

On Fri, Jan 10, 2014 at 3:40 PM, Juraj Sukop <juraj.sukop@gmail.com> wrote:
What this all means is that the PDF objects are expressed in ASCII, "stream" objects like images and fonts may have a binary part and I never saw those UTF+16 strings.
hmm -- I wonder if they are out there in the wild, though....
u"stream\n%s\nendstream\nendobj"%binary_data.decode('latin-1')
The argument for dropping "%f" et al. has been that if something is a text, then it should be Unicode. Conversely, if it is not text, then it should not be Unicode.
???? What I'm trying to demostrate / test is that you can use unicode objects for mixed binary + ascii, if you make sure to encode/decode using latin-1 ( any others?). The idea is that ascii can be seen/used as text, and other bytes are preserved, and you can ignore whatever meaning latin-1 gives them. using unicode objects means that you can use the existing string formatting (%s), and if you want to pass in binary blobs, you need to decode them as latin-1, creating a unicode object, which will get interpolated into your unicode object, but then that unicode gets encoded back to latin-1, the original bytes are preserved. I think this it confusing, as we are calling it latin-1, but not really using it that way, but it seems it should work. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

2014/1/10 Juraj Sukop <juraj.sukop@gmail.com>:
In the case of PDF, the embedding of an image into PDF looks like:
10 0 obj << /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 >> stream ...binary image data... endstream endobj
What not building "10 0 obj ... stream" and "endstream endobj" in Unicode and then encode to ASCII? Example: data = b''.join(( ("%d %d obj ... stream" % (10, 0)).encode('ascii'), binary_image_data, ("endstream endobj").encode('ascii'), )) Victor

On 1/10/2014 5:12 PM, Victor Stinner wrote:
2014/1/10 Juraj Sukop <juraj.sukop@gmail.com>:
In the case of PDF, the embedding of an image into PDF looks like:
10 0 obj << /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 >> stream ...binary image data... endstream endobj
What not building "10 0 obj ... stream" and "endstream endobj" in Unicode and then encode to ASCII? Example:
data = b''.join(( ("%d %d obj ... stream" % (10, 0)).encode('ascii'), binary_image_data, ("endstream endobj").encode('ascii'), ))
Isn't the point of the PEP to make it easier to port 2.x code to 3.5? Is there really existing code like this in 2.x? I think what we're trying to do is to make code that looks like: b'%d %d obj ... stream' % (10, 0) work in both 2.x and 3.5. But correct me if I'm wrong. I'll admit to not following 100% of these emails. Eric.

On Fri, 10 Jan 2014 17:20:32 -0500 "Eric V. Smith" <eric@trueblade.com> wrote:
Isn't the point of the PEP to make it easier to port 2.x code to 3.5? Is there really existing code like this in 2.x?
No, but so what? The point of the PEP is not to allow arbitrary Python 2 code to run without modification under Python 3. There's a reason we broke compatibility, and there's no way we're gonna undo that.
I think what we're trying to do is to make code that looks like: b'%d %d obj ... stream' % (10, 0) work in both 2.x and 3.5.
That's not what *I* am trying to do. As far as I'm concerned the aim of the PEP is to ease bytes interpolation, not to provide some kind of magical construct that will solve everyone's porting problems. Regards Antoine.

On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner <victor.stinner@gmail.com>wrote:
What not building "10 0 obj ... stream" and "endstream endobj" in Unicode and then encode to ASCII? Example:
data = b''.join(( ("%d %d obj ... stream" % (10, 0)).encode('ascii'), binary_image_data, ("endstream endobj").encode('ascii'), ))
The key is "encode to ASCII" which means that the result is bytes. Then, there is this "11 0 obj" which should also be bytes. But it has no "binary_image_data" - only lots of numbers waiting to be somehow converted to bytes. I already mentioned the problems with ".encode('ascii')" but it does not stop here. Numbers may appear not only inside "streams" but almost anywhere: in the header there is PDF version, an image has to have "width" and "height", at the end of PDF there is a structure containing offsets to all of the objects in file. Basically, to ".encode('ascii')" every possible number is not exactly simple or pretty.

On Sat, 11 Jan 2014 00:43:39 +0100 Juraj Sukop <juraj.sukop@gmail.com> wrote:
Basically, to ".encode('ascii')" every possible number is not exactly simple or pretty.
Well it strikes me that the PDF format itself is not exactly simple or pretty. It might be convenient that Python 2 allows you, in certain cases, to "ignore" encoding issues because the main text type is actually a bytestring, but under the Python 3 model there's no reason to allow the same shortcuts. Also, when you say you've never encountered UTF-16 text in PDFs, it sounds like those people who've never encountered any non-ASCII data in their programs. Regards Antoine.

On Sat, Jan 11, 2014 at 12:49 AM, Antoine Pitrou <solipsis@pitrou.net>wrote:
Also, when you say you've never encountered UTF-16 text in PDFs, it sounds like those people who've never encountered any non-ASCII data in their programs.
Let me clarify: one does not think in "writing text in Unicode"-terms in PDF. Instead, one records the sequence of "character codes" which correspond to "glyphs" or the glyph IDs directly. That's because one Unicode character may have more than one glyph and more characters can be shown as one glyph.

On Fri, Jan 10, 2014 at 9:13 PM, Juraj Sukop <juraj.sukop@gmail.com> wrote:
On Sat, Jan 11, 2014 at 12:49 AM, Antoine Pitrou <solipsis@pitrou.net>wrote:
Also, when you say you've never encountered UTF-16 text in PDFs, it sounds like those people who've never encountered any non-ASCII data in their programs.
Let me clarify: one does not think in "writing text in Unicode"-terms in PDF. Instead, one records the sequence of "character codes" which correspond to "glyphs" or the glyph IDs directly. That's because one Unicode character may have more than one glyph and more characters can be shown as one glyph.
AFAIK (and just for the record), there could be both Latin1 text and UTF-16 in a PDF (and other encodings too), depending on the font used: /Encoding /WinAnsiEncoding (mostly latin1 "standard" fonts) /Encoding /Identity-H (generally for unicode UTF-16 True Type "embedded" fonts) For example, in PyFPDF (a PHP library ported to python), the following code writes out text that could be encoded in two different encodings: s = sprintf("BT %.2f %.2f Td (%s) Tj ET", x*self.k, (self.h-y)*self.k, txt) https://code.google.com/p/pyfpdf/source/browse/fpdf/fpdf.py#602 In Python2, txt is just a str, but in Python3 handling everything as latin1 string obviously doesn't work for TTF in this case. Best regards Mariano Reingart http://www.sistemasagiles.com.ar http://reingart.blogspot.com

On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
AFAIK (and just for the record), there could be both Latin1 text and UTF-16 in a PDF (and other encodings too), depending on the font used: [...] In Python2, txt is just a str, but in Python3 handling everything as latin1 string obviously doesn't work for TTF in this case.
Nobody is suggesting that you use Latin-1 for *everything*. We're suggesting that you use it for blobs of binary data that represent arbitrary bytes. First you have to get your binary data in the first place, using whatever technique is necessary. Here's one way to get a blob of binary data: # encode four C shorts into a fixed-width struct struct.pack(">hhhh", 23, 42, 17, 99) Here's another way: # encode a text string into UTF-16 "My name is Steven".encode("utf-16be") Both examples return a bytes object containing arbitrary bytes. How do you combine those arbitrary bytes with a string template while still keeping all code-points under U+0100? By decoding to Latin-1. -- Steven

On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano <steve@pearwood.info>wrote:
On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
AFAIK (and just for the record), there could be both Latin1 text and UTF-16 in a PDF (and other encodings too), depending on the font used: [...] In Python2, txt is just a str, but in Python3 handling everything as latin1 string obviously doesn't work for TTF in this case.
Nobody is suggesting that you use Latin-1 for *everything*. We're suggesting that you use it for blobs of binary data that represent arbitrary bytes. First you have to get your binary data in the first place, using whatever technique is necessary.
Just to check I understood what you are saying. Instead of writing: content = b'\n'.join([ b'header', b'part 2 %.3f' % number, binary_image_data, utf16_string.encode('utf-16be'), b'trailer']) it should now look like: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.encode('utf-16be').decode('latin-1'), 'trailer']).encode('latin-1') Correct?

On 12 Jan 2014 21:53, "Juraj Sukop" <juraj.sukop@gmail.com> wrote:
On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano <steve@pearwood.info>
wrote:
On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
AFAIK (and just for the record), there could be both Latin1 text and
UTF-16
in a PDF (and other encodings too), depending on the font used: [...] In Python2, txt is just a str, but in Python3 handling everything as latin1 string obviously doesn't work for TTF in this case.
Nobody is suggesting that you use Latin-1 for *everything*. We're suggesting that you use it for blobs of binary data that represent arbitrary bytes. First you have to get your binary data in the first place, using whatever technique is necessary.
Just to check I understood what you are saying. Instead of writing:
content = b'\n'.join([ b'header', b'part 2 %.3f' % number, binary_image_data, utf16_string.encode('utf-16be'), b'trailer'])
it should now look like:
content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.encode('utf-16be').decode('latin-1'), 'trailer']).encode('latin-1')
Why are you proposing to do the *join* in text space? Encode all the parts separately, concatenate them with b'\n'.join() (or whatever separator is appropriate). It's only the *text formatting operation* that needs to be done in text space and then explicitly encoded (and this example doesn't even need latin-1,ASCII is sufficient): content = b'\n'.join([ b'header', ('part 2 %.3f' % number).encode('ascii'), binary_image_data, utf16_string.encode('utf-16be'), b'trailer'])
Correct?
My updated version above is the reasonable way to do it in Python 3, and the one I consider clearly superior to reintroducing implicit encoding to ASCII as part of the core text model. This is why I *don't* have a problem with PEP 460 as it stands - it's just syntactic sugar for something you can already do with b''.join(), and thus not particularly controversial. It's only proposals that add any form of implicit encoding that silently switches from the text domain to the binary domain that conflict with the core Python 3 text model (although third party types remain largely free to do whatever they want). Cheers, Nick.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On Sun, Jan 12, 2014 at 2:16 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Why are you proposing to do the *join* in text space? Encode all the parts separately, concatenate them with b'\n'.join() (or whatever separator is appropriate). It's only the *text formatting operation* that needs to be done in text space and then explicitly encoded (and this example doesn't even need latin-1,ASCII is sufficient):
I apparently misunderstood what was Steven suggesting, thanks for the clarification.

On Sun, Jan 12, 2014 at 11:16:37PM +1000, Nick Coghlan wrote:
content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.encode('utf-16be').decode('latin-1'), 'trailer']).encode('latin-1')
Why are you proposing to do the *join* in text space?
In defence of that, doing the join as text may be useful if you have additional text processing that you want to do after assembling the whole string, but before calling encode. Even if you intend to encode to bytes at the end, you might prefer to work in the text domain right until just before the end: - no need for b' prefixes; - indexing a string returns a 1-char string, not an int; - can use the full range of % formatting, etc. -- Steven

On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote:
On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano <steve@pearwood.info>wrote:
On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
AFAIK (and just for the record), there could be both Latin1 text and UTF-16 in a PDF (and other encodings too), depending on the font used: [...] In Python2, txt is just a str, but in Python3 handling everything as latin1 string obviously doesn't work for TTF in this case.
Nobody is suggesting that you use Latin-1 for *everything*. We're suggesting that you use it for blobs of binary data that represent arbitrary bytes. First you have to get your binary data in the first place, using whatever technique is necessary.
Just to check I understood what you are saying. Instead of writing:
content = b'\n'.join([ b'header', b'part 2 %.3f' % number, binary_image_data, utf16_string.encode('utf-16be'), b'trailer'])
Which doesn't work, since bytes don't support %f in Python 3.
it should now look like:
content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.encode('utf-16be').decode('latin-1'), 'trailer']).encode('latin-1')
Correct?
Not quite as you show. First, "utf16_string" confuses me. What is it? If it is a Unicode string, i.e.: # Python 3 semantics type(utf16_string) => returns str then the name is horribly misleading, and it is best handled like this: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string, # Misleading name, actually Unicode string 'trailer']) Note that since it's text, and content is text, there is no need to encode then decode. "UTF-16" is not another name for "Unicode". Unicode is a character set. UTF-16 is just one of a number of different encodings which map the 0x10FFFF distinct Unicode characters (actually "code points") to bytes. UTF-16 is one possible way to implement Unicode strings in memory, but not the only way. Python has, or does, use four distinct implementations: 1) UTF-16 in "narrow builds" 2) UTF-32 in "wide builds" 3) a hybrid approach starting in Python 3.3, where strings are stored as either: 3a) Latin-1 3b) UCS-2 3c) UTF-32 depending on the content of the string. So calling an arbitrary string "utf16_string" is misleading or wrong. On the other hand, if it is actually a bytes object which is the product of UTF-16 encoding, i.e.: type(utf16_string) => returns bytes and those bytes were generated by "some text".encode("utf-16"), then it is already binary data and needs to be smuggled into the text string. Latin-1 is good for that: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.decode('latin-1'), 'trailer']) Both examples assume that you intend to do further processing of content before sending it, and will encode just before sending: content.encode('utf-8') (Don't use Latin-1, since it cannot handle the full range of text characters.) If that's not the case, then perhaps this is better suited to what you are doing: content = b'\n'.join([ b'header', ('part 2 %.3f' % number).encode('ascii'), binary_image_data, # already bytes utf16_string, # already bytes b'trailer']) -- Steven

Wait a second, this is how I understood it but what Nick said made me think otherwise... On Sun, Jan 12, 2014 at 6:22 PM, Steven D'Aprano <steve@pearwood.info>wrote:
On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote:
On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano <steve@pearwood.info wrote:
Just to check I understood what you are saying. Instead of writing:
content = b'\n'.join([ b'header', b'part 2 %.3f' % number, binary_image_data, utf16_string.encode('utf-16be'), b'trailer'])
Which doesn't work, since bytes don't support %f in Python 3.
I know and this was an example of the ideal (for me, anyway) way of formatting bytes.
First, "utf16_string" confuses me. What is it? If it is a Unicode string, i.e.:
It is a Unicode string which happens to contain code points outside U+00FF (as with the TTF example above), so that it triggers the (at least) 2-bytes memory representation in CPython 3.3+. I agree, I chose the variable name poorly, my bad.
content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string, # Misleading name, actually Unicode string 'trailer'])
Which, because of that horribly-named-variable, prevents the use of simple memcpy and makes the image data occupy way more memory than as when it was in simple bytes.
Both examples assume that you intend to do further processing of content before sending it, and will encode just before sending:
Not really, I was interested to compare it to bytes formatting, hence it included the "encode()" as well.

On Mon, Jan 13, 2014 at 4:57 AM, Juraj Sukop <juraj.sukop@gmail.com> wrote:
On Sun, Jan 12, 2014 at 6:22 PM, Steven D'Aprano <steve@pearwood.info> wrote:
First, "utf16_string" confuses me. What is it? If it is a Unicode string, i.e.:
It is a Unicode string which happens to contain code points outside U+00FF (as with the TTF example above), so that it triggers the (at least) 2-bytes memory representation in CPython 3.3+. I agree, I chose the variable name poorly, my bad.
When I'm talking about Unicode strings based on their maximum codepoint, I usually call them something like "ASCII string", "Latin-1 string", "BMP string", and "SMP string". Still not wholly accurate, but less confusing than naming an encoding... oh wait, two of those _are_ encodings :| But you could use "narrow string" for the first two. Or "string(0..127)" for ASCII, "string(0..255)" for Latin-1, and then for consistency "string(0..65535)" and "string(0..1114111)" for the others, except that I doubt that'd be helpful :) At any rate, "BMP" as a term for "includes characters outside of Latin-1 but all on the Basic Multilingual Plane" would probably be close enough to get away with. ChrisA

Steven D'Aprano writes:
then the name is horribly misleading, and it is best handled like this:
content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string, # Misleading name, actually Unicode string 'trailer'])
This loses bigtime, as any encoding that can handle non-latin1 in utf16_string will corrupt binary_image_data. OTOH, latin1 will raise on non-latin1 characters. utf16_string must be encoded appropriately then decoded by latin1 to be reencoded by latin1 on output.
On the other hand, if it is actually a bytes object which is the product of UTF-16 encoding, i.e.:
type(utf16_string) => returns bytes
and those bytes were generated by "some text".encode("utf-16"), then it is already binary data and needs to be smuggled into the text string. Latin-1 is good for that:
content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.decode('latin-1'), 'trailer'])
Both examples assume that you intend to do further processing of content before sending it, and will encode just before sending:
content.encode('utf-8')
(Don't use Latin-1, since it cannot handle the full range of text characters.)
This corrupts binary_image_data. Each byte > 127 will be replaced by two bytes. In the second case, you can use latin1 to encode, it it gives you what you want. This kind of subtlety is precisely why MAL warned about use of latin1 to smuggle bytes.

On 01/12/2014 02:31 PM, Stephen J. Turnbull wrote:
This corrupts binary_image_data. Each byte > 127 will be replaced by two bytes. In the second case, you can use latin1 to encode, it it gives you what you want.
This kind of subtlety is precisely why MAL warned about use of latin1 to smuggle bytes.
And why I've been fighting Steven D'Aprano on it. -- ~Ethan~

Ethan Furman writes:
This kind of subtlety is precisely why MAL warned about use of latin1 to smuggle bytes.
And why I've been fighting Steven D'Aprano on it.
No, I think you haven't been fighting Steven d'A on "it". You're talking about parsing and generating structured binary files, he's talking about techniques for parsing and generating streams with no real structure above the byte or encoded character level. Of course you can implement the former with the latter using Python 3 "str", but it's ugly, maybe even painful if you need to encode binary blobs back to binary to process them. (More discussion in my other post, although I suspect you're not going to be terribly happy with that, either. ;-) This generally *is not* the case for the wire protocol guys. AFAICT they really do want to process things as streams of ASCII-compatible text, with the non-ASCII stuff treated as runs of uninterpreted bytes that are just passed through. So when you talk about "we", I suspect you are not the "we" everybody else is arguing with. In particular, AIUI your use case is not included in the use cases most of us -- including Steven -- are thinking about.

On 01/12/2014 04:02 PM, Stephen J. Turnbull wrote:
So when you talk about "we", I suspect you are not the "we" everybody else is arguing with. In particular, AIUI your use case is not included in the use cases most of us -- including Steven -- are thinking about.
Ah, so even in the minority I'm in the minority. :/ The "we" I am usually referring to are those of us who have to deal with the mixed ASCII/binary/encoded text files (a couple have spoken up about PDFs, and I have DBF). -- ~Ethan~

On Mon, Jan 13, 2014 at 07:31:16AM +0900, Stephen J. Turnbull wrote:
Steven D'Aprano writes:
then the name is horribly misleading, and it is best handled like this:
content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string, # Misleading name, actually Unicode string 'trailer'])
This loses bigtime, as any encoding that can handle non-latin1 in utf16_string will corrupt binary_image_data. OTOH, latin1 will raise on non-latin1 characters. utf16_string must be encoded appropriately then decoded by latin1 to be reencoded by latin1 on output.
Of course you're right, but I have understood the above as being a sketch and not real code. (E.g. does "header" really mean the literal string "header", or does it stand in for something which is a header?) In real code, one would need to have some way of telling where the binary image data ends and the Unicode string begins. If I have misunderstood the situation, then my apologies for compounding the error [...]
Both examples assume that you intend to do further processing of content before sending it, and will encode just before sending:
content.encode('utf-8')
(Don't use Latin-1, since it cannot handle the full range of text characters.)
This corrupts binary_image_data. Each byte > 127 will be replaced by two bytes.
And reading it back using decode('utf-8') will replace those two bytes with a single byte, round-tripping exactly. Of course if you encode to UTF-8 and then try to read the binary data as raw bytes, you'll get corrupted data. But do people expect to do this? That's a genuine question -- again, I assumed (apparently wrongly) that the idea was to write the content out as *text* containing smuggled bytes, and read it back the same way.
In the second case, you can use latin1 to encode, it it gives you what you want.
This kind of subtlety is precisely why MAL warned about use of latin1 to smuggle bytes.
How would you smuggle a chunk of arbitrary bytes into a text string? Short of doing something like uuencoding it into ASCII, or equivalent. -- Steven

Steven D'Aprano writes:
Of course you're right, but I have understood the above as being a sketch and not real code. (E.g. does "header" really mean the literal string "header", or does it stand in for something which is a header?) In real code, one would need to have some way of telling where the binary image data ends and the Unicode string begins.
Sure, but I think in Ethan's case it's probably out of band. I have been assuming out of band.
This corrupts binary_image_data. Each byte > 127 will be replaced by two bytes.
And reading it back using decode('utf-8') will replace those two bytes with a single byte, round-tripping exactly.
True, but I'm assuming Ethan himself didn't choose DBF format.
Of course if you encode to UTF-8 and then try to read the binary data as raw bytes, you'll get corrupted data. But do people expect to do this?
People? Real People use Python, they wouldn't do that. :-) But the app that forced Ethan to deal with DBF might.
This kind of subtlety is precisely why MAL warned about use of latin1 to smuggle bytes.
How would you smuggle a chunk of arbitrary bytes into a text string? Short of doing something like uuencoding it into ASCII, or equivalent.
Arbitary bytes as a chunk? I wouldn't do that, probably (see below), and it's not possible in Python 3 at present (in str ASCII codes always represent the corresponding ASCII character, they are never uninterpreted bytes). But if I know where the bytes are going to be in the str, I'd use latin1 or (encoding='ascii', errors='surrogateescape') depending on how well-controlled the processing is. If I really "own" those bytes, I might use latin1, and just "forget" all of the string-processing functions that care about character identity (eg, case manipulation). If the bytes might somehow end up leaking into the rest of the program, I'd use surrogateescape and live with the doubled space usage. But really, if it's not a wire-to-wire protocol kind of thing, I'd go ahead and create a proper model for the data, and text would be text, and chunks of arbitrary bytes would be bytes and integers would be integers....

On 11Jan2014 00:43, Juraj Sukop <juraj.sukop@gmail.com> wrote:
On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner <victor.stinner@gmail.com>wrote:
What not building "10 0 obj ... stream" and "endstream endobj" in Unicode and then encode to ASCII? Example:
data = b''.join(( ("%d %d obj ... stream" % (10, 0)).encode('ascii'), binary_image_data, ("endstream endobj").encode('ascii'), ))
The key is "encode to ASCII" which means that the result is bytes. Then, there is this "11 0 obj" which should also be bytes. But it has no "binary_image_data" - only lots of numbers waiting to be somehow converted to bytes. I already mentioned the problems with ".encode('ascii')" but it does not stop here. Numbers may appear not only inside "streams" but almost anywhere: in the header there is PDF version, an image has to have "width" and "height", at the end of PDF there is a structure containing offsets to all of the objects in file. Basically, to ".encode('ascii')" every possible number is not exactly simple or pretty.
Hi Juraj, Might I suggest a helper function (outside the PEP scope) instead of arguing for support for %f et al? Thus: def bytify(things, encoding='ascii'): for thing: if isinstance(thing, bytes): yield thing else: yield str(thing).encode('ascii') Then one's embedding in PDF might become, more readably: data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) ) Of course, bytify might be augmented with whatever encoding facilities might suit your needs. Cheers, -- Cameron Simpson <cs@zip.com.au> We tend to overestimate the short-term impact of technological change and underestimate its long-term impact. - Amara's Law

On Sat, Jan 11, 2014 at 5:14 AM, Cameron Simpson <cs@zip.com.au> wrote:
Hi Juraj,
Hello Cameron.
data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) )
Thanks for the suggestion! The problem with "bytify" is that some items might require different formatting than other items. For example, in "Cross-Reference Table" there are three different formats: non-padded integer ("1"), 10- and 15digit integer, ("0000000003", "65535").

On 11Jan2014 13:15, Juraj Sukop <juraj.sukop@gmail.com> wrote:
On Sat, Jan 11, 2014 at 5:14 AM, Cameron Simpson <cs@zip.com.au> wrote:
data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) )
Thanks for the suggestion! The problem with "bytify" is that some items might require different formatting than other items. For example, in "Cross-Reference Table" there are three different formats: non-padded integer ("1"), 10- and 15digit integer, ("0000000003", "65535").
Well, this is partly my point: you probably want to exert more control that is reasonable for the PEP to offer, and you're better off with a helper function of your own. In particular, aside from passing in a default char=>bytes encoding, you can provide your own format hooks. In particular, str already provides a completish % suite and you have no issue with encodings in that phase because it is all Unicode. So the points where you're treating PDF as text are probably best tackled as text and then encoded with a helper like bytify when you have to glom bytes and "textish" stuff together. Crude example, hacked up from yours: data = b''.join( bytify( ("%d %d obj ... stream" % (10, 0)), binary_image_data, "endstream endobj", ))) where bytify swallows your encoding decisions. Since encoding anything-not-bytes into a bytes sequence inherently involves an encoding decision, I think I'm +1 on the PEP's aim of never mixing bytes with non-bytes, keeping all the encoding decisions in the caller's hands. I quite understand not wanting to belabour the code with ".encode('ascii')" but that should be said somewhere, so best to do so yourself in as compact and ergonomic fashion as possible. Cheers, -- Cameron Simpson <cs@zip.com.au> Serious error. All shortcuts have disappeared. Screen. Mind. Both are blank. - Haiku Error Messages http://www.salonmagazine.com/21st/chal/1998/02/10chal2.html

On Fri, Jan 10, 2014 at 06:17:02PM +0100, Juraj Sukop wrote:
As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example "100" or "1.23".
I'm sorry, I don't understand what you mean here. I'm honestly not trying to be difficult, but you sound confident that you understand what you are doing, but your description doesn't make sense to me. To me, it looks like you are conflating bytes and ASCII characters, that is, assuming that characters "are" in some sense identical to their ASCII representation. Let me explain: The integer that in English is written as 100 is represented in memory as bytes 0x0064 (assuming a big-endian C short), so when you say "an integer is written down AS-IS" (emphasis added), to me that says that the PDF file includes the bytes 0x0064. But then you go on to write the three character string "100", which (assuming ASCII) is the bytes 0x313030. Going from the C short to the ASCII representation 0x313030 is nothing like inserting the int "as-is". To put it another way, the Python 2 '%d' format code does not just copy bytes. I think that what you are trying to say is that a PDF file is a binary file which includes some ASCII-formatted text fields. So when writing an integer 100, rather than writing it "as is" which would be byte 0x64 (with however many leading null bytes needed for padding), it is converted to ASCII representation 0x313030 first, and that's what needs to be inserted. If you consider PDF as binary with occasional pieces of ASCII text, then working with bytes makes sense. But I wonder whether it might be better to consider PDF as mostly text with some binary bytes. Even though the bulk of the PDF will be binary, the interesting bits are text. E.g. your example:
In the case of PDF, the embedding of an image into PDF looks like:
10 0 obj << /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 >> stream ...binary image data... endstream endobj
Even though the binary image data is probably much, much larger in length than the text shown above, it's (probably) trivial to deal with: convert your image data into bytes, decode those bytes into Latin-1, then concatenate the Latin-1 string into the text above. Latin-1 has the nice property that every byte decodes into the character with the same code point, and visa versa. So: for i in range(256): assert bytes([i]).decode('latin-1') == chr(i) assert chr(i).encode('latin-1') == bytes([i]) passes. It seems to me that your problem goes away if you use Unicode text with embedded binary data, rather than binary data with embedded ASCII text. Then when writing the file to disk, of course you encode it to Latin-1, either explicitly: pdf = ... # Unicode string containing the PDF contents with open("outfile.pdf", "wb") as f: f.write(pdf.encode("latin-1") or implicitly: with open("outfile.pdf", "w", encoding="latin-1") as f: f.write(pdf) There may be a few wrinkles I haven't thought of, I don't claim to be an expert on PDF. But I see no reason why PDF files ought to be an exception to the rule: * work internally with Unicode text; * convert to and from bytes only on input and output. Please also take note that in Python 3.3 and better, the internal representation of Unicode strings containing only code points up to 255 (i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte per character. Another advantage is that using text rather than bytes means that your example: [...]
dropping the bytes-formatting of numbers makes it more complicated than it was. I would appreciate any explanation on how:
b'%.1f %.1f %.1f RG' % (r, g, b)
becomes simply '%.1f %.1f %.1f RG' % (r, g, b) in Python 3. In Python 3.3 and above, it can be written as: u'%.1f %.1f %.1f RG' % (r, g, b) which conveniently is exactly the same syntax you would use in Python 2. That's *much* nicer than your suggestion:
is more confusing than:
b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r, g, b)))
-- Steven

On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano <steve@pearwood.info>wrote:
I'm sorry, I don't understand what you mean here. I'm honestly not trying to be difficult, but you sound confident that you understand what you are doing, but your description doesn't make sense to me. To me, it looks like you are conflating bytes and ASCII characters, that is, assuming that characters "are" in some sense identical to their ASCII representation. Let me explain:
The integer that in English is written as 100 is represented in memory as bytes 0x0064 (assuming a big-endian C short), so when you say "an integer is written down AS-IS" (emphasis added), to me that says that the PDF file includes the bytes 0x0064. But then you go on to write the three character string "100", which (assuming ASCII) is the bytes 0x313030. Going from the C short to the ASCII representation 0x313030 is nothing like inserting the int "as-is". To put it another way, the Python 2 '%d' format code does not just copy bytes.
Sorry, I should've included an example: when I said "as-is" I meant "1", "0", "0" so that would be yours "0x313030."
If you consider PDF as binary with occasional pieces of ASCII text, then working with bytes makes sense. But I wonder whether it might be better to consider PDF as mostly text with some binary bytes. Even though the bulk of the PDF will be binary, the interesting bits are text. E.g. your example:
Even though the binary image data is probably much, much larger in length than the text shown above, it's (probably) trivial to deal with: convert your image data into bytes, decode those bytes into Latin-1, then concatenate the Latin-1 string into the text above.
This is similar to what Chris Barker suggested. I also don't try to be difficult here but please explain to me one thing. To treat bytes as if they were Latin-1 is bad idea, that's why "%f" got dropped in the first place, right? How is it then alright to put an image inside an Unicode string? Also, apart from the in/out conversions, do any other difficulties come to your mind? Please also take note that in Python 3.3 and better, the internal
representation of Unicode strings containing only code points up to 255 (i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte per character.
I guess you meant [C]Python... In any case, thanks for the detailed reply.

On Sat, Jan 11, 2014 at 01:56:56PM +0100, Juraj Sukop wrote:
On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano <steve@pearwood.info>wrote:
If you consider PDF as binary with occasional pieces of ASCII text, then working with bytes makes sense. But I wonder whether it might be better to consider PDF as mostly text with some binary bytes. Even though the bulk of the PDF will be binary, the interesting bits are text. E.g. your example:
10 0 obj << /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 >> stream ...binary image data... endstream endobj
Even though the binary image data is probably much, much larger in length than the text shown above, it's (probably) trivial to deal with: convert your image data into bytes, decode those bytes into Latin-1, then concatenate the Latin-1 string into the text above.
This is similar to what Chris Barker suggested. I also don't try to be difficult here but please explain to me one thing. To treat bytes as if they were Latin-1 is bad idea,
Correct. Bytes are not Latin-1. Here are some bytes which represent a word I extracted from a text file on my computer: b'\x8a\x75\xa7\x65\x72\x73\x74' If you imagine that they are Latin-1, you might think that the word is a C1 control character ("VTS", or Vertical Tabulation Set) followed by "u§erst", but it is not. It is actually the German word "äußerst" ("extremely"), and the text file was generated on a 1990s vintage Macintosh using the MacRoman "extended ASCII" code page.
that's why "%f" got dropped in the first place, right? How is it then alright to put an image inside an Unicode string?
The point that I am making is that many people want to add formatting operations to bytes so they can put ASCII strings inside bytes. But (as far as I can tell) they don't need to do this, because they can treat Unicode strings containing code points U+0000 through U+00FF (i.e. the same range as handled by Latin-1) as if they were bytes. This gives you: - convenient syntax, no need to prefix strings with b; - mostly avoid needing to decode and encode strings, except at a few points in your code; - the full set of string methods; - can easily include arbitrary octal or hex byte values, using \o and \x escapes; - error checking: when you finally encode the text to bytes before writing to a file, or sending over a wire, any code-point greater than U+00FF will give you an exception unless explicitly silenced. No need to wait for Python 3.5 to come out, you can do this *right now*. Of course, this is a little bit "unclean", it breaks the separation of text and bytes by treating bytes *as if* they were Unicode code points, which they are not, but I believe that this is a practical technique which is not too hard to deal with. For instance, suppose I have a mixed format which consists of an ASCII tag, a number written in ASCII, a NULL separator, and some binary data: # Using bytes values = [29460, 29145, 31098, 27123] blob = b"".join(struct.pack(">h", n) for n in values) data = b"Tag:" + str(len(values)).encode('ascii') + b"\0" + blob => gives data = b'Tag:4\x00s\x14q\xd9yzi\xf3' That's a bit ugly, but not too ugly. I could write code like that. But if bytes had % formatting, I might write this instead: data = b"Tag:%d\0%s" % (len(values), blob) This is a small improvement, but I can't use it until Python 3.5 comes out. Or I could do this right now: # Using text values = [29460, 29145, 31098, 27123] blob = b"".join(struct.pack(">h", n) for n in values) data = "Tag:%d\0%s" % (len(values), blob.decode('latin-1')) => gives data = 'Tag:4\x00s\x14qÙyzió' When I'm ready to transmit this over the wire, or write to disk, then I encode, and get: data.encode('latin-1') => b'Tag:4\x00s\x14q\xd9yzi\xf3' which is exactly the same as I got in the first place. In this case, I'm not using Latin-1 for the semantics of bytes to characters (e.g. byte \xf3 = char ó), but for the useful property that all 256 distinct bytes are valid in Latin-1. Any other encoding with the same property will do. It is a little unfortunate that struct gives bytes rather than a str, but you can hide that with a simple helper function: def b2s(bytes): return bytes.decode('latin1') data = "Tag:%d\0%s" % (len(values), b2s(blob))
Also, apart from the in/out conversions, do any other difficulties come to your mind?
No. If you accidentally introduce a non-Latin1 code point, when you decode you'll get an exception. -- Steven

On 01/11/2014 07:38 AM, Steven D'Aprano wrote:
The point that I am making is that many people want to add formatting operations to bytes so they can put ASCII strings inside bytes. But (as far as I can tell) they don't need to do this, because they can treat Unicode strings containing code points U+0000 through U+00FF (i.e. the same range as handled by Latin-1) as if they were bytes.
So instead of blurring the line between bytes and text, you're blurring the line between text and bytes (with a few extra seat belts thrown in). Besides being a bit awkward, this also means that any encoded text (even the plain ASCII stuff) is now being transformed three times instead of one: unicode to bytes bytes to unicode using latin1 unicode to bytes Even if the cost of moving those bytes around is cheap, it's not free. When you're creating hundreds of PDFs at a time that's going to make a difference. -- ~Ethan~

On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
On 01/11/2014 07:38 AM, Steven D'Aprano wrote:
The point that I am making is that many people want to add formatting operations to bytes so they can put ASCII strings inside bytes. But (as far as I can tell) they don't need to do this, because they can treat Unicode strings containing code points U+0000 through U+00FF (i.e. the same range as handled by Latin-1) as if they were bytes.
So instead of blurring the line between bytes and text, you're blurring the line between text and bytes (with a few extra seat belts thrown in).
I'm not blurring anything. The people who designed the file format that mixes textual data and binary data did the blurring. Given that such formats exist, it is inevitable that we need to put text into bytes, or bytes into text. The situation is already blurred, we just have to decide how to handle it. There are three broad strategies: 1) Make bytes more string-like, so that we can process our data as bytes, but still do string operations on the bits that are ASCII. 2) Make strings more byte-like, so that we can process our data as strings, but do byte operations (like bit mask operations) on the parts that are binary data. 3) Don't do either. Keep the text parts of your data as text, and the binary parts of your data as bytes. Do your text operations on text, and your byte operations on bytes. At some point, of course, they need to be combined. We have a choice: * Right now, we can use text as the base, and combine bytes into the text using Latin-1, and it Just Works. * Or we can wait until (maybe) Python 3.5, when (perhaps) bytes objects will be more text-like, and then use bytes as the base, and (with luck) it Should Just Work. There's another disadvantage with the second: treating bytes as if they were ASCII by default reinforces the same old harmful paradigm that text is ASCII that we're trying to get away from. That's a bad, painful idea that causes a lot of problems and buggy code, and should be resisted. On the other hand, embedding arbitrary binary data in Unicode text doesn't reinforce any common or harmful paradigms. It just requires the programmer to forget about characters and concentrate on code points, since Latin-1 maps bytes to code points in a very convenient way: Byte 0x00 maps to code point U+0000 Byte 0x01 maps to code point U+0001 Byte 0x02 maps to code point U+0002 ... Byte 0xFF maps to code point U+00FF So to embed the binary data 0xDEADBEEF in your string, you can just use '\xDE\xAD\xBE\xEF' regardless of what character those code points happen to be. If we are manipulating data *as if it were text*, then we ought to treat it as text, not add methods to bytes that makes bytes text-like. If we are manipulating data *as if it were bytes*, doing byte-manipulation operations like bit-masking, then we ought to treat it as numeric bytes, not add numeric methods to text. Is that really a controversial opinion?
Besides being a bit awkward, this also means that any encoded text (even the plain ASCII stuff) is now being transformed three times instead of one:
unicode to bytes bytes to unicode using latin1 unicode to bytes
Where do you get this from? I don't follow your logic. Start with a text template: template = """\xDE\xAD\xBE\xEF Name:\0\0\0%s Age:\0\0\0\0%d Data:\0\0\0%s blah blah blah """ data = template % ("George", 42, blob.decode('latin-1')) Only the binary blobs need to be decoded. We don't need to encode the template to bytes, and the textual data doesn't get encoded until we're ready to send it across the wire or write it to disk. And when we do, since all the code points are in the range U+0000 to U+00FF, encoding it to Latin-1 ought to be a fast, efficient operation, possibly even just a mem copy. It's true that the individual binary data fields will been to be decoded from bytes, but unless you want Python to guess an encoding (which is the old broken Python 2 model), you're going to have to do that regardless.
Even if the cost of moving those bytes around is cheap, it's not free. When you're creating hundreds of PDFs at a time that's going to make a difference.
You've profiled it? Unless you've measured it, it doesn't exist. I'm not going to debate performance penalties of code you haven't written yet. -- Steven

On 01/11/2014 10:36 AM, Steven D'Aprano wrote:
On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
unicode to bytes bytes to unicode using latin1 unicode to bytes
Where do you get this from? I don't follow your logic. Start with a text template:
template = """\xDE\xAD\xBE\xEF Name:\0\0\0%s Age:\0\0\0\0%d Data:\0\0\0%s blah blah blah """
data = template % ("George", 42, blob.decode('latin-1'))
Only the binary blobs need to be decoded. We don't need to encode the template to bytes, and the textual data doesn't get encoded until we're ready to send it across the wire or write it to disk.
And what if your name field has data not representable in latin-1? --> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8') u'\u0441\u0440\u0403' --> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256) So really your example should be: data = template % ("George".encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 42, blob.decode('latin-1')) Which is a mess. -- ~Ethan~

On Sat, Jan 11, 2014 at 11:05:36AM -0800, Ethan Furman wrote:
On 01/11/2014 10:36 AM, Steven D'Aprano wrote:
On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
unicode to bytes bytes to unicode using latin1 unicode to bytes
Where do you get this from? I don't follow your logic. Start with a text template:
template = """\xDE\xAD\xBE\xEF Name:\0\0\0%s Age:\0\0\0\0%d Data:\0\0\0%s blah blah blah """
data = template % ("George", 42, blob.decode('latin-1'))
Since the use-cases people have been speaking about include only ASCII (or at most, Latin-1) text and arbitrary binary bytes, my example is limited to showing only ASCII text. But it will work with any text data, so long as you have a well-defined format that lets you tell which parts are interpreted as text and which parts as binary data. If your file format is not well-defined, then you have bigger problems than dealing with text versus bytes.
Only the binary blobs need to be decoded. We don't need to encode the template to bytes, and the textual data doesn't get encoded until we're ready to send it across the wire or write it to disk.
And what if your name field has data not representable in latin-1?
--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8') u'\u0441\u0440\u0403'
Where did you get those bytes from? You got them from somewhere. Who knows? Who cares? Once you have bytes, you can treat them as a blob of arbitrary bytes and write them to the record using the Latin-1 trick. If you're reading those bytes from some stream that gives you bytes, you don't have to care where they came from. But what if you don't start with bytes? If you start with a bunch of floats, you'll probably convert them to bytes using the struct module. If you start with non-ASCII text, you have to convert them to bytes too. No difference here. You ask the user for their name, they answer "срЃ" which is given to you as a Unicode string, and you want to include it in your data record. The specifications of your file format aren't clear, so I'm going to assume that: 1) ASCII text is allowed "as-is" (that is, the name "George" will be in the final data file as b'George'); 2) any other non-ASCII text will be encoded as some fixed encoding which we can choose to suit ourselves; (if the encoding is fixed by the file format, then just use that) 3) arbitrary binary data is allowed "as-is" (i.e. byte N has to end up being written as byte N, for any value of N between 0 and 255). So, to write the ASCII name "George", we can just "Name:\0\0\0%s" % "George" since we know it is already ASCII. (It's a literal, so that's obvious. But see below.) To write arbitrary binary data, we take the *bytes* and decode to Latin-1: blob = bunch_o_bytes() # Completely arbitrary. "Data:\0\0\0%s" % blob.decode('latin-1')) Combine those two techniques to deal with non-ASCII names. First you have to get the non-ASCII name converted to *arbitrary bytes*, so any encoding that deals with the whole range of Unicode will do. Then you convert those arbitary bytes into Latin-1. Here I'll use UTF-32, just because I can and I feel like being wasteful: "Name:\0\0\0%s" % "срЃ".encode("utf-32be").decode("latin-1") UTF-8 is a better choice, because it doesn't use as much space and gives you something which looks like ASCII in a hex editor: name = "George" if random.random() < 0.5 else "срЃ" "Name:\0\0\0%s" % name.encode("utf-8").decode("latin-1") If you don't know whether your name is pure ASCII, then you have to encode first. Otherwise how do you know what bytes to use? Aside: if this point is not *bleedingly obvious*, then you need to read Joel on Software on Unicode RIGHT NOW. http://www.joelonsoftware.com/articles/Unicode.html If the name data happens to be pure ASCII, then encoding to UTF-8 and decoding to Latin-1 ends up being a no-op: py> "George".encode("utf-8").decode("latin-1") 'George' Of course, if I know that the name is ASCII ahead of time (I wrote it as a literal, so I think I would know...) then I can short-cut the whole process and just do this: "Name:\0\0\0%s" % name_which_is_guaranteed_to_be_ascii If I screw up and insert a non-Latin-1 character, then when I eventually write it to a file, it will give me a Unicode error, exactly as it should. I've assumed that I can pick the encoding. That's rather like assuming that, given a bunch of floats, I can pick whether to represent them as C doubles or singles or something else, whatever suits my purposes. If I'm dealing with some existing file format, it probably defines the encoding, either explicitly or implicitly. When I don't have the choice of encoding, but have to use some damned stupid legacy encoding that only includes a fraction of Unicode, then: name.encode("legacy encoding", errors="whatever") will give me the bytes I need to use the Latin-1 trick on. This whole thing can be wrapped in a tiny one-line helper function: def bytify(text, encoding="utf-8", errors="ignore"): # pick your own appropriate encoding and error handler return text.encode(encoding, errors).decode('latin-1')
--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)
That is backwards to what I've shown. Look at my earlier example again: data = template % ("George", 42, blob.decode('latin-1')) Bytes get DECODED to latin-1, not encoded. Bytes -> text is *decoding* Text -> bytes is *encoding*
So really your example should be:
data = template % ("George".encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 42, blob.decode('latin-1'))
Which is a mess.
Obviously it is stupid and wasteful to do that to a literal that you know is ASCII. But if you don't know what the contents of the string are, how do you know what bytes need to be written unless you encode to bytes first? -- Steven

On 01/11/2014 06:29 PM, Steven D'Aprano wrote:
On Sat, Jan 11, 2014 at 11:05:36AM -0800, Ethan Furman wrote:
On 01/11/2014 10:36 AM, Steven D'Aprano wrote:
On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
unicode to bytes bytes to unicode using latin1 unicode to bytes
Where do you get this from? I don't follow your logic. Start with a text template:
template = """\xDE\xAD\xBE\xEF Name:\0\0\0%s Age:\0\0\0\0%d Data:\0\0\0%s blah blah blah """
data = template % ("George", 42, blob.decode('latin-1'))
Since the use-cases people have been speaking about include only ASCII (or at most, Latin-1) text and arbitrary binary bytes, my example is limited to showing only ASCII text. But it will work with any text data, so long as you have a well-defined format that lets you tell which parts are interpreted as text and which parts as binary data.
Since you're talking to me, it would be nice if you addressed the same use-case I was addressing, which is mixed: ascii-encoded text, ascii-encoded numbers, ascii-encoded bools, binary-encoded numbers, and misc-encoded text. And no, your example will not work with any text, it would completely moji-bake my dbf files.
Only the binary blobs need to be decoded. We don't need to encode the template to bytes, and the textual data doesn't get encoded until we're ready to send it across the wire or write it to disk.
No! When I have text, part of which gets ascii-encoded and part of which gets, say, cp1251 encoded, I cannot wait till the end!
And what if your name field has data not representable in latin-1?
--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8') u'\u0441\u0440\u0403'
Where did you get those bytes from? You got them from somewhere.
For the sake of argument, pretend a user entered them in.
Who knows? Who cares? Once you have bytes, you can treat them as a blob of arbitrary bytes and write them to the record using the Latin-1 trick.
No, I can't. See above.
If you're reading those bytes from some stream that gives you bytes, you don't have to care where they came from.
You're kidding, right? If I don't know where they came from (a graphics field? a note field?) how am I going to know how to treat them?
But what if you don't start with bytes? If you start with a bunch of floats, you'll probably convert them to bytes using the struct module.
Yup, and I do.
If you start with non-ASCII text, you have to convert them to bytes too. No difference here.
Really? You just said above that "it will work with any text data" -- you can't have it both ways.
You ask the user for their name, they answer "срЃ" which is given to you as a Unicode string, and you want to include it in your data record. The specifications of your file format aren't clear, so I'm going to assume that:
1) ASCII text is allowed "as-is" (that is, the name "George" will be in the final data file as b'George');
User data is not (typically) where the ASCII data is, but some of the metadata is definitely and always ASCII. The user text data needs to be encoded using whichever codec is specified by the file, which is only occasionally ASCII.
2) any other non-ASCII text will be encoded as some fixed encoding which we can choose to suit ourselves;
Well, the user chooses it, we have to abide by their choice. (It's kept in the file metadata.)
3) arbitrary binary data is allowed "as-is" (i.e. byte N has to end up being written as byte N, for any value of N between 0 and 255).
In a couple field types, yes. Usually the binary data is numeric or date related and there is conversion going on there, too, to give me the bytes I need. [snip]
--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)
That is backwards to what I've shown. Look at my earlier example again:
And you are not paying attention: '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1') \--------------------------------------/ \-------------/ a non-ascii compatible unicode string to latin1 bytes ("срЃ".encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 42, blob.decode('latin-1')) \----------------------------------------------/ \--------------/ getting the actual bytes I need and back into unicode until I write them later You did say to use a *text* template to manipulate my data, and then write it later, no? Well, this is what it would look like.
Bytes get DECODED to latin-1, not encoded.
Bytes -> text is *decoding* Text -> bytes is *encoding*
Pretend for a moment I know that, and look at my examples again. I am demonstrating the contortions needed when my TEXTual data is not ASCII-compatible: It must be ENcoded using the appropriate codec to BYTES, then DEcoded back to unicode using latin1, all so later I can ENcode the bloomin' unicode data structure back to bytes using latin1 again. Dizzy yet? And you must know this, because it is what your bytify function does. Are you trolling? -- ~Ethan~

Changing the subject line to better describe what we're talking about. I hope it is of interest to others apart from Ethan and I -- mixed bytes and text is hard to get right. (And if I've got something wrong, I'd like to know about it.) On Sat, Jan 11, 2014 at 08:38:49PM -0800, Ethan Furman wrote:
On 01/11/2014 06:29 PM, Steven D'Aprano wrote: [...] Since you're talking to me, it would be nice if you addressed the same use-case I was addressing, which is mixed: ascii-encoded text, ascii-encoded numbers, ascii-encoded bools, binary-encoded numbers, and misc-encoded text.
I thought I had addressed it. But since your use-case is underspecified, please excuse me if I get some of it wrong.
And no, your example will not work with any text, it would completely moji-bake my dbf files.
I don't think it will. Admittedly, I don't know all the ins and outs of your files, but as far as I can tell, nothing you have said so far suggests that my plan will fail. Code code speaks louder than words: http://www.pearwood.info/ethan_demo.py This code produces a string containing smuggled bytes. There is: - a header containing raw bytes; - metadata consisting of the name of some encoding in ASCII; - A series of tagged fields. Each field has a name, which is always ASCII, and terminated with a colon. It is then followed by a single ASCII character and some data: * T for some arbitrary chunk of text, encoded in the metadata encoding, with a length byte prefix (that is, like a Pascal string); * F for a boolean flag "true" or "false" in ASCII; * N for an integer, a C long; * D for an integer, in ASCII, terminated at the first non-digit; * B for a chunk of arbitrary bytes, with a two-byte length prefix. And the whole thing is written out to a file, then read back in, without data corruption or mojibake. I wrote this about 1am this morning, so it may or may not be a shining example of idiomatic Python code, but it works and is readable. I understand that this won't match your actual use-case precisely, but I hope it contains the same sorts of mixed binary data and ASCII text that you're talking about. There are fixed width fields, variable length fields, binary fields, ASCII fields, non-ASCII text, and multiple encodings, all living in perfect harmony :-) And it runs unchanged under both Python 2.7 and 3.3. As so often happens, what seems good in principle is less useful in practce. Once I actually started writing code, I quickly moved beyond the simple model: template = "some text" data = template % ("text", 42, b'\x16foo'.decode('latin-1')) that I thought would be easy to a more structured approach. So I wrote reader and writer classes and abstracted away the messy bits, although in truth none of it is very messy. The worst is dealing with the 2 versus 3 differences, and even that requires only a handful of small helper functions. I don't claim that the code I tossed together is the optimal design, or bug-free, or even that the exact same approach will work for your specific case. But it is enough to demonstrate that the basic idea is sound, you can process mixed text and bytes in a clean way, it doesn't generate mojibake, and can operate in both 2.7 and 3.3 without even using a __future__ directive.
Only the binary blobs need to be decoded. We don't need to encode the template to bytes, and the textual data doesn't get encoded until we're ready to send it across the wire or write it to disk.
No! When I have text, part of which gets ascii-encoded and part of which gets, say, cp1251 encoded, I cannot wait till the end!
I think we are talking about different textual data. It's a bit ambiguous, my apologies. You're talking about taking individual fields and deciding how to process them. I'm talking about doing your processing in the text domain, which means at the end of the process I have a Unicode string object rather than a bytes object. Before that str can be written to disk, it needs to be encoded.
And what if your name field has data not representable in latin-1?
--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8') u'\u0441\u0440\u0403'
Where did you get those bytes from? You got them from somewhere.
For the sake of argument, pretend a user entered them in.
Who knows? Who cares? Once you have bytes, you can treat them as a blob of arbitrary bytes and write them to the record using the Latin-1 trick.
No, I can't. See above.
If you're reading those bytes from some stream that gives you bytes, you don't have to care where they came from.
You're kidding, right? If I don't know where they came from (a graphics field? a note field?) how am I going to know how to treat them?
As I understand it, you want the ability to store *arbitrary bytes* in the file, right? Here are nine arbitrary bytes: b'\x82\xE1\xC2\0\0\x7B\0\xFF\xA8' You don't need to know how I generated them, whether they are sound samples, data from a serial port, three RGB values, or some strange C struct. I need to know how to generate them, but you can treat them as an opaque blob. They're *already* bytes, you're not responsible for converting whatever the data was into bytes, because it's already done. It's just a blob of bytes as far as you're concerned. All you need to do is smuggle them into a text string.
But what if you don't start with bytes? If you start with a bunch of floats, you'll probably convert them to bytes using the struct module.
Yup, and I do.
If you start with non-ASCII text, you have to convert them to bytes too. No difference here.
Really?
Again, I fear I failed to explain myself in sufficient detail. If your non-ASCII text doesn't match the encoding specified, how else are you going to include it? See below.
You just said above that "it will work with any text data" -- you can't have it both ways.
I have been unclear, I apologise. Let me try again with an example. As the end-user, I get to specify the encoding, that's what you said. Okay, I specify ISO-8859-7, which is Greek. Now obviously if I hand you a bunch of Russian letters in a string, and you try to encode them using ISO-8859-7, you're going to get an exception. That's okay, as presumably I'm sensible enough to only include characters which exist in the encoding I choose, and if not, its my own damn fault. But suppose I have a reason for this strange behaviour. If I pre-encode those Russian letters to bytes, using (say) UTF-16, then I can hand you the raw bytes to store as a binary blob. Later, I get the binary blob back again, and I can decode them using UTF-16, to get the original Russian text back again. So long as you don't mangle the binary blob, the process is completely reversable. That is what I am talking about.
You ask the user for their name, they answer "срЃ" which is given to you as a Unicode string, and you want to include it in your data record. The specifications of your file format aren't clear, so I'm going to assume that:
1) ASCII text is allowed "as-is" (that is, the name "George" will be in the final data file as b'George');
User data is not (typically) where the ASCII data is, but some of the metadata is definitely and always ASCII. The user text data needs to be encoded using whichever codec is specified by the file, which is only occasionally ASCII.
2) any other non-ASCII text will be encoded as some fixed encoding which we can choose to suit ourselves;
Well, the user chooses it, we have to abide by their choice. (It's kept in the file metadata.)
3) arbitrary binary data is allowed "as-is" (i.e. byte N has to end up being written as byte N, for any value of N between 0 and 255).
In a couple field types, yes. Usually the binary data is numeric or date related and there is conversion going on there, too, to give me the bytes I need.
The above all sounds reasonable. But the following does not -- I think it shows some fundamental confusion on your part.
[snip]
--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)
That is backwards to what I've shown. Look at my earlier example again:
And you are not paying attention:
'\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1') \--------------------------------------/ \-------------/ a non-ascii compatible unicode string to latin1 bytes
You can't *decode* Unicode strings. Try it in Python 3, and it breaks: py> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'str' object has no attribute 'decode' For your code to work, you can't be using Python 3, you have to be using Python 2, where "..." is already bytes, not Unicode. Since it's a byte string, there's no point in decoding it into UTF-8, then encoding it back to bytes. All you are doing is running the risk of UnicodeEncodingError: # Python 2.7 this time py> '\xd0\x94'.decode('utf-8').encode('latin-1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0414' in position 0: ordinal not in range(256) Latin-1 does not work with arbitrary *characters*, but it does work with arbitrary *bytes*. You're trying to take a UTF-8 encoded byte string, decode back to arbitrary Unicode characters, then *encode* to Latin-1, which may fail. What I am doing is taking arbitrary *bytes*, then *decode* to Latin-1 as a way of smuggling those bytes into a str.
("срЃ".encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 42, blob.decode('latin-1')) \----------------------------------------------/ \--------------/ getting the actual bytes I need and back into unicode until I write them later
In Python 3, that works, but I'm not sure if it does what you intend (I don't know what you intend). You have encode and decode the right way around this time, for Python 3 strings. In Python 2, the interpreter (wrongly) accepts "срЃ" as a byte-string literal, but the results are poorly defined. What you actually get (probably) depends on your enviroment. On my system, I seem to get UTF-8 encoded bytes, but that's not guaranteed.
You did say to use a *text* template to manipulate my data, and then write it later, no? Well, this is what it would look like.
If the text strings the user gives you are compatible with the encoding they specify, you don't need that. Just use: ("срЃ", 42, blob.decode('latin-1')) It's the user's responsibility if they choose to specify an encoding which is more restrictive than the contents of some field. If they do that, they have to encode that field somehow, so they can treat it as a binary blob. *You* don't have to do this, and you certainly don't have to take perfectly good text and turn it into bytes then back to text just so you can insert it back into text. That would be silly.
Bytes get DECODED to latin-1, not encoded.
Bytes -> text is *decoding* Text -> bytes is *encoding*
Pretend for a moment I know that, and look at my examples again.
Sorry to be harsh, but based on your swapping decode and encode around above in the examples above, I would have to pretend :-)
I am demonstrating the contortions needed when my TEXTual data is not ASCII-compatible: It must be ENcoded using the appropriate codec to BYTES, then DEcoded back to unicode using latin1, all so later I can ENcode the bloomin' unicode data structure back to bytes using latin1 again. Dizzy yet?
No. If I, the end user, insist on using a stupid legacy encoding, then *YES* absolutely of course I have to jump through hoops to store arbitrary Unicode characters using a legacy encoding that only supports a tiny subset of Unicode. This should not surprise you.
And you must know this, because it is what your bytify function does. Are you trolling?
No. -- Steven

On Mon, Jan 13, 2014 at 01:03:15PM +1100, Steven D'Aprano wrote:
code speaks louder than words: http://www.pearwood.info/ethan_demo.py
[...] Ethan refers to code like: template % ("срЃ".encode('cp1251').decode('latin-1'), 42, blob.decode('latin-1'))
You did say to use a *text* template to manipulate my data, and then write it later, no? Well, this is what it would look like.
If the text strings the user gives you are compatible with the encoding they specify, you don't need that. Just use:
("срЃ", 42, blob.decode('latin-1'))
It's the user's responsibility if they choose to specify an encoding which is more restrictive than the contents of some field. If they do that, they have to encode that field somehow, so they can treat it as a binary blob. *You* don't have to do this, and you certainly don't have to take perfectly good text and turn it into bytes then back to text just so you can insert it back into text. That would be silly.
It occurs to me that I do exactly that in my demo code :-) In my defence, it was 1am when I wrote it, and I am a little unclear about Nathan's use-case whether the entire file is supposed to be compatible with the cp1251 encoding (the example that he gives), or just individual fields in it. If I understood the requirements better, my code would probably be able to avoid some of those encodes/decodes, or I might even decide that working in the text domain is a mistake and instead we should look to smuggle text into bytes rather than the other way around. Regardless of which way you go, I'm not seeing that mixed bytes and text should be a reason to hold off migrating from 2 to 3. Which is where this discussion started days and days ago. -- Steven

On 01/12/2014 06:21 PM, Steven D'Aprano wrote:
On Mon, Jan 13, 2014 at 01:03:15PM +1100, Steven D'Aprano wrote:
code speaks louder than words: http://www.pearwood.info/ethan_demo.py
[...]
Ethan refers to code like:
template % ("срЃ".encode('cp1251').decode('latin-1'), 42, blob.decode('latin-1'))
It occurs to me that I do exactly that in my demo code :-)
Well, at least you see the point I was trying to make, even if you don't agree. :) I apologize again for my typos that made it look like I had no idea what I was talking about. ;) -- ~Ethan~

On 01/12/2014 06:03 PM, Steven D'Aprano wrote:
The above all sounds reasonable. But the following does not -- I think it shows some fundamental confusion on your part.
My apologies. The '\xd1.....' was a bytestring, I forgot to type the b. (I know, I know, I should've copied and pasted :( ) -- ~Ethan~

On 2014-01-11 05:36, Steven D'Aprano wrote: [snip]
Latin-1 has the nice property that every byte decodes into the character with the same code point, and visa versa. So:
for i in range(256): assert bytes([i]).decode('latin-1') == chr(i) assert chr(i).encode('latin-1') == bytes([i])
passes. It seems to me that your problem goes away if you use Unicode text with embedded binary data, rather than binary data with embedded ASCII text. Then when writing the file to disk, of course you encode it to Latin-1, either explicitly:
pdf = ... # Unicode string containing the PDF contents with open("outfile.pdf", "wb") as f: f.write(pdf.encode("latin-1")
or implicitly:
with open("outfile.pdf", "w", encoding="latin-1") as f: f.write(pdf)
[snip] The second example won't work because you're forgetting about the handling of line endings in text mode. Suppose you have some binary data bytes([10]). You convert it into a Unicode string using Latin-1, giving '\n'. You write it out to a file opened in text mode. On Windows, that string '\n' will be written to the file as b'\r\n'.

MRAB writes:
with open("outfile.pdf", "w", encoding="latin-1") as f: f.write(pdf)
[snip] The second example won't work because you're forgetting about the handling of line endings in text mode.
Not so fast! Forgot, yes (me too!), but not work? Not quite: with open("outfile.pdf", "w", encoding="latin-1", newline="") as f: f.write(pdf) should do the trick.

On 01/11/2014 11:49 AM, Stephen J. Turnbull wrote:
MRAB writes:
with open("outfile.pdf", "w", encoding="latin-1") as f: f.write(pdf)
[snip] The second example won't work because you're forgetting about the handling of line endings in text mode.
Not so fast! Forgot, yes (me too!), but not work? Not quite:
with open("outfile.pdf", "w", encoding="latin-1", newline="") as f: f.write(pdf)
should do the trick.
Well, it's good that there is a work-a-round. Are we going to have a document listing all the work-a-rounds needed to program a bytes-oriented style using unicode? -- ~Ethan~

On Sat, 11 Jan 2014 11:54:26 -0800, Ethan Furman <ethan@stoneleaf.us> wrote:
On 01/11/2014 11:49 AM, Stephen J. Turnbull wrote:
MRAB writes:
with open("outfile.pdf", "w", encoding="latin-1") as f: f.write(pdf)
[snip] The second example won't work because you're forgetting about the handling of line endings in text mode.
Not so fast! Forgot, yes (me too!), but not work? Not quite:
with open("outfile.pdf", "w", encoding="latin-1", newline="") as f: f.write(pdf)
should do the trick.
Well, it's good that there is a work-a-round. Are we going to have a document listing all the work-a-rounds needed to program a bytes-oriented style using unicode?
That's not a work-around (if you are talking specifically about the newline=""). That's just the way the python3 IO library works. If you want to preserve the newlines in your data, but still have the text-io machinery count them for deciding when to trigger io/buffering behavior, you use newline=''. It's not the most intuitive API, so I won't be surprised if a lot of people don't know about it or get confused by it when they see it. I first learned about it in the context of csv files, another one of those legacy file protocols that are mostly-text-but-not-entirely. --David

On Sat, Jan 11, 2014 at 07:22:30PM +0000, MRAB wrote:
with open("outfile.pdf", "w", encoding="latin-1") as f: f.write(pdf)
[snip] The second example won't work because you're forgetting about the handling of line endings in text mode.
So I did! Thank you for the correction. -- Steven
participants (16)
-
Antoine Pitrou
-
Cameron Simpson
-
Chris Angelico
-
Chris Barker
-
Eric V. Smith
-
Ethan Furman
-
Georg Brandl
-
INADA Naoki
-
Juraj Sukop
-
Mariano Reingart
-
MRAB
-
Nick Coghlan
-
R. David Murray
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Victor Stinner