[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

Mon Jan 13 00:43:55 CET 2014

On Mon, Jan 13, 2014 at 07:31:16AM +0900, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
> 
>  > then the name is horribly misleading, and it is best handled like this:
>  > 
>  >     content = '\n'.join([
>  >         'header',
>  >         'part 2 %.3f' % number,
>  >         binary_image_data.decode('latin-1'),
>  >         utf16_string,  # Misleading name, actually Unicode string
>  >         'trailer'])
> 
> This loses bigtime, as any encoding that can handle non-latin1 in
> utf16_string will corrupt binary_image_data.  OTOH, latin1 will raise
> on non-latin1 characters.  utf16_string must be encoded appropriately
> then decoded by latin1 to be reencoded by latin1 on output.

Of course you're right, but I have understood the above as being a 
sketch and not real code. (E.g. does "header" really mean the literal 
string "header", or does it stand in for something which is a header?) 
In real code, one would need to have some way of telling where the 
binary image data ends and the Unicode string begins.

If I have misunderstood the situation, then my apologies for compounding 
the error

[...]
>  > Both examples assume that you intend to do further processing of content 
>  > before sending it, and will encode just before sending:
>  > 
>  >     content.encode('utf-8')
>  > 
>  > (Don't use Latin-1, since it cannot handle the full range of text 
>  > characters.)
> 
> This corrupts binary_image_data.  Each byte > 127 will be replaced by
> two bytes.

And reading it back using decode('utf-8') will replace those two bytes 
with a single byte, round-tripping exactly.

Of course if you encode to UTF-8 and then try to read the binary data as 
raw bytes, you'll get corrupted data. But do people expect to do this? 
That's a genuine question -- again, I assumed (apparently wrongly) that 
the idea was to write the content out as *text* containing smuggled 
bytes, and read it back the same way.

> In the second case, you can use latin1 to encode, it it
> gives you what you want.
> 
> This kind of subtlety is precisely why MAL warned about use of latin1
> to smuggle bytes.

How would you smuggle a chunk of arbitrary bytes into a text string? 
Short of doing something like uuencoding it into ASCII, or equivalent.

-- 
Steven