[Python-Dev] Patch making the current email package (mostly) support bytes

Mon Oct 4 19:38:35 CEST 2010

On Mon, 04 Oct 2010 12:32:26 -0400, Scott Dial <scott+python-dev at scottdial.com> wrote:
> On 10/2/2010 7:00 PM, R. David Murray wrote:
> > The clever hack (thanks ultimately to Martin) is to accept 8bit data
> > by encoding it using the ASCII codec and the surrogateescape error
> > handler.
> 
> I've seen this idea pop up in a number of threads. I worry that you are
> all inventing a new kind of dual that is a direct parallel to Python 2.x
> strings.

Yes, that is exactly my worry.

> That is to say,
> 
> 3.x>>> b = b'\xc2\xa1'
> 3.x>>> s = b.decode('utf8')
> 3.x>>> v = b.decode('ascii', 'surrogateescape')
> 
> , where s and v should be the same "thing" in 3.x but they are not due
> to an encoding trick.

Why "should" they be the same thing in 3.x?  One is an ASCII string with
some escaped bytes in an unknown encoding, the other is a valid unicode
string.  The surrogateescape trick is used only when we don't *know*
the encoding (a priori) of the bytes in question.

> I believe this trick generates more-or-less the same issues as strings
> did in 2.x:
> 
> 2.x>>> b = '\xc2\xa1'
> 2.x>>> s = b.decode('utf8')
> 2.x>>> v = b

The difference is that in 2.x people could and would operate on strings as
if they knew the encoding, and get in trouble.  In 3.x you can't do that.
If you've got escaped bytes you *know* that you don't know the encoding,
and the program can't get around that except by re-encoding to bytes
and properly decoding them.

> Any reasonable 2.x code has to guard on str/unicode and it would seem in
> 3.x, if this idiom spreads, reasonable code will have to guard on
> surrogate escapes (which actually seems like a more expensive test). As in,
> 
> 3.x>>> print(v)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in
> position 0: surrogates not allowed

Right, I mentioned that concern in my post.

In this case at least, however, the *goal* is that the surrogates are
never seen outside the email internals.  In reflection of this, my latest
thought is that I should add a 'message_from_binary_file' helper method
and a 'feedbytes' method to feedparser, making the surrogates a 100%
internal implementation detail[*].  Only if the email package contains a
coding error would the surrogates escape and cause problems for user
code.

> It seems like this hack is about making the 3.x unicode type more like
> the 2.x string type, and I thought we decided that was a bad idea. How
> will developers not have to ask themselves whether a given string is a
> "real" string or a byte sequence masquerading as a string? Am I missing
> something here?

I think this question is something that needs to be considered any
time using surrogates is proposed.  I hope that in the email package
proposal I've addressed it.  What do you think?

--David

[*] And you are right that there is a performance concern as a result
of needing to detect surrogates at various points in the code.