[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
steve at pearwood.info
Mon Jan 13 00:43:55 CET 2014
On Mon, Jan 13, 2014 at 07:31:16AM +0900, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
> > then the name is horribly misleading, and it is best handled like this:
> > content = '\n'.join([
> > 'header',
> > 'part 2 %.3f' % number,
> > binary_image_data.decode('latin-1'),
> > utf16_string, # Misleading name, actually Unicode string
> > 'trailer'])
> This loses bigtime, as any encoding that can handle non-latin1 in
> utf16_string will corrupt binary_image_data. OTOH, latin1 will raise
> on non-latin1 characters. utf16_string must be encoded appropriately
> then decoded by latin1 to be reencoded by latin1 on output.
Of course you're right, but I have understood the above as being a
sketch and not real code. (E.g. does "header" really mean the literal
string "header", or does it stand in for something which is a header?)
In real code, one would need to have some way of telling where the
binary image data ends and the Unicode string begins.
If I have misunderstood the situation, then my apologies for compounding
> > Both examples assume that you intend to do further processing of content
> > before sending it, and will encode just before sending:
> > content.encode('utf-8')
> > (Don't use Latin-1, since it cannot handle the full range of text
> > characters.)
> This corrupts binary_image_data. Each byte > 127 will be replaced by
> two bytes.
And reading it back using decode('utf-8') will replace those two bytes
with a single byte, round-tripping exactly.
Of course if you encode to UTF-8 and then try to read the binary data as
raw bytes, you'll get corrupted data. But do people expect to do this?
That's a genuine question -- again, I assumed (apparently wrongly) that
the idea was to write the content out as *text* containing smuggled
bytes, and read it back the same way.
> In the second case, you can use latin1 to encode, it it
> gives you what you want.
> This kind of subtlety is precisely why MAL warned about use of latin1
> to smuggle bytes.
How would you smuggle a chunk of arbitrary bytes into a text string?
Short of doing something like uuencoding it into ASCII, or equivalent.
More information about the Python-Dev