[Python-Dev] Patch making the current email package (mostly) support bytes

Mon Oct 4 18:32:26 CEST 2010

On 10/2/2010 7:00 PM, R. David Murray wrote:
> The clever hack (thanks ultimately to Martin) is to accept 8bit data
> by encoding it using the ASCII codec and the surrogateescape error
> handler.

I've seen this idea pop up in a number of threads. I worry that you are
all inventing a new kind of dual that is a direct parallel to Python 2.x
strings. That is to say,

3.x>>> b = b'\xc2\xa1'
3.x>>> s = b.decode('utf8')
3.x>>> v = b.decode('ascii', 'surrogateescape')

, where s and v should be the same "thing" in 3.x but they are not due
to an encoding trick. I believe this trick generates more-or-less the
same issues as strings did in 2.x:

2.x>>> b = '\xc2\xa1'
2.x>>> s = b.decode('utf8')
2.x>>> v = b

Any reasonable 2.x code has to guard on str/unicode and it would seem in
3.x, if this idiom spreads, reasonable code will have to guard on
surrogate escapes (which actually seems like a more expensive test). As in,

3.x>>> print(v)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in
position 0: surrogates not allowed

It seems like this hack is about making the 3.x unicode type more like
the 2.x string type, and I thought we decided that was a bad idea. How
will developers not have to ask themselves whether a given string is a
"real" string or a byte sequence masquerading as a string? Am I missing
something here?

-- 
Scott Dial
scott at scottdial.com
scodial at cs.indiana.edu