[Python-Dev] email package status in 3.X

Mon Jun 21 23:41:19 CEST 2010

On Mon, Jun 21, 2010 at 04:09:52PM -0400, P.J. Eby wrote:
> At 03:29 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
> >On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote:
> >> At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
> >> >What do you think of making the encoding attribute a mandatory part of
> >> >creating an ebyte object?  (ex: ``eb = ebytes(b, 'euc-jp')``).
> >>
> >> As long as the coercion rules force str+ebytes (or str % ebytes,
> >> ebytes % str, etc.) to result in another ebytes (and fail if the str
> >> can't be encoded in the ebytes' encoding), I'm personally fine with
> >> it, although I really like the idea of tacking the encoding to bytes
> >> objects in the first place.
> >>
> >I wouldn't like this.  It brings us back to the python2 problem where
> >sometimes you pass an ebyte into a function and it works and other times you
> >pass an ebyte into the function and it issues a traceback.
> 
> For stdlib functions, this isn't going to happen unless your ebytes'
> encoding is not compatible with the ascii subset of unicode, or the
> stdlib function is working with dynamic data...  in which case you
> really *do* want to fail early!
> 
The ebytes encoding will often be incompatible with the ascii subset.
It's the reason that people were so often tempted to change the
defaultencoding on python2 to utf8.

> I don't see this as a repeat of the 2.x situation; rather, it allows
> you to cause errors to happen much *earlier* than they would
> otherwise show up if you were using unicode for your encoded-bytes
> data.
> 
> For example, if your program's intent is to end up with latin-1
> output, then it would be better for an error to show up at the very
> *first* point where non-latin1 characters are mixed with your data,
> rather than only showing up at the output boundary!
> 
That highly depends on your usage.  If you're formatting a comment on a web
page, checking at output and replacing with '?' is better than a traceback.
If you're entering key values into a database, then you likely want to know
where the non-latin1 data is entering your program, not where it's mixed
with your data or the output boundary.

> However, if you promoted mixed-type operation results to unicode
> instead of ebytes, then you:
> 
> 1) can't preserve data that doesn't have a 1:1 mapping to unicode, and
> 
ebytes should be immutable like bytes and str.  So you shouldn't lose the
data if you keep a reference to it.

> 2) can't detect an error until your data reaches the output point in
> your application -- forcing you to defensively insert ebytes calls
> everywhere (vs. simply wrapping them around a handful of designated
> inputs), or else have to go right back to tracing down where the
> unusable data showed up in the first place.
> 
Usually, you don't want to know where you are combining two incompatible
strings.  Instead, you want to know where the incompatible strings are being
set in the first place.  If function(a, b) tracebacks with certain
combinations of a and b I need to know where a and b are being set, not
where function(a, b) is in the source code.  So you need to be making input
values ebytes() (or str in current python3) no matter what.

> One thing that seems like a bit of a blind spot for some folks is
> that having unicode is *not* everybody's goal.  Not because we don't
> believe unicode is generally a good thing or anything like that, but
> because we have to work with systems that flat out don't *do*
> unicode, thereby making the presence of (fully-general) unicode an
> error condition that has to be stamped out!
> 
I think that sometimes as well.  However, here I think you're in a bit of
a blind spot yourself.  I'm saying that making ebytes + str coerce to ebytes
will only yield a traceback some of the time; which is the python2
behaviour.  Having ebytes + str coerce to str will never throw a traceback
as long as our implementation checks that the bytes and encoding work
together fro mthe start.

Throwing an error in code, only on some input is one of the main reasons
that debugging unicode vs byte issues sucks on python2.  On my box, with my
dataset, everything works.  Toss it up on pypi and suddenly I have a user in
Japan who reports that he gets a traceback with his dataset that he can't
give to me because it's proprietary, overly large, or transient.

> IOW, if you're producing output that has to go into another system
> that doesn't take unicode, it doesn't matter how
> theoretically-correct it would be for your app to process the data in
> unicode form.  In that case, unicode is not a feature: it's a bug.
> 
This is not always true.  If you read a webpage, chop it up so you get
a list of words, create a histogram of word length, and then write the output as
utf8 to a database.  Should you do all your intermediate string operations
on utf8 encoded byte strings?  No, you should do them on unicode strings as
otherwise you need to know about the details of how utf8 encodes characters.

> And as it really *is* an error in that case, it should not pass
> silently, unless explicitly silenced.
> 
This is very true -- although the python3 stdlib does explicitly silence
errors related to unicode in some cases.

Anyhow -- IMHO, you should get a TypeError when you attempt to pass
a unicode value into a function that is meant to work with bytes.  (You can
accept an ebytes object as well since it has a known bytes representation).

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100621/be3dcc6c/attachment-0001.pgp>