[Python-Dev] Smuggling bytes into text (was Re: RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5)

Steven D'Aprano steve at pearwood.info
Mon Jan 13 03:21:25 CET 2014


On Mon, Jan 13, 2014 at 01:03:15PM +1100, Steven D'Aprano wrote:

> code speaks louder than words: http://www.pearwood.info/ethan_demo.py

[...]

Ethan refers to code like:

template % ("срЃ".encode('cp1251').decode('latin-1'), 42, blob.decode('latin-1'))

> > You did say to use a *text* template to manipulate my data, and then write 
> > it later, no?  Well, this is what it would look like.
> 
> If the text strings the user gives you are compatible with the 
> encoding they specify, you don't need that. Just use:
> 
> ("срЃ", 42, blob.decode('latin-1'))
> 
> It's the user's responsibility if they choose to specify an encoding 
> which is more restrictive than the contents of some field. If they do 
> that, they have to encode that field somehow, so they can treat it as a 
> binary blob. *You* don't have to do this, and you certainly don't have 
> to take perfectly good text and turn it into bytes then back to text 
> just so you can insert it back into text. That would be silly.

It occurs to me that I do exactly that in my demo code :-)

In my defence, it was 1am when I wrote it, and I am a little unclear 
about Nathan's use-case whether the entire file is supposed to be 
compatible with the cp1251 encoding (the example that he gives), or just 
individual fields in it. If I understood the requirements better, my 
code would probably be able to avoid some of those encodes/decodes, or I 
might even decide that working in the text domain is a mistake and 
instead we should look to smuggle text into bytes rather than the other 
way around.

Regardless of which way you go, I'm not seeing that mixed bytes and text 
should be a reason to hold off migrating from 2 to 3. Which is where 
this discussion started days and days ago.

-- 
Steven


More information about the Python-Dev mailing list