On Mon, Jun 21, 2010 at 04:09:52PM -0400, P.J. Eby wrote:
At 03:29 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote:
At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
What do you think of making the encoding attribute a mandatory part of creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``).
As long as the coercion rules force str+ebytes (or str % ebytes, ebytes % str, etc.) to result in another ebytes (and fail if the str can't be encoded in the ebytes' encoding), I'm personally fine with it, although I really like the idea of tacking the encoding to bytes objects in the first place.
I wouldn't like this. It brings us back to the python2 problem where sometimes you pass an ebyte into a function and it works and other times you pass an ebyte into the function and it issues a traceback.
For stdlib functions, this isn't going to happen unless your ebytes' encoding is not compatible with the ascii subset of unicode, or the stdlib function is working with dynamic data... in which case you really *do* want to fail early!
The ebytes encoding will often be incompatible with the ascii subset. It's the reason that people were so often tempted to change the defaultencoding on python2 to utf8.
I don't see this as a repeat of the 2.x situation; rather, it allows you to cause errors to happen much *earlier* than they would otherwise show up if you were using unicode for your encoded-bytes data.
For example, if your program's intent is to end up with latin-1 output, then it would be better for an error to show up at the very *first* point where non-latin1 characters are mixed with your data, rather than only showing up at the output boundary!
That highly depends on your usage. If you're formatting a comment on a web page, checking at output and replacing with '?' is better than a traceback. If you're entering key values into a database, then you likely want to know where the non-latin1 data is entering your program, not where it's mixed with your data or the output boundary.
However, if you promoted mixed-type operation results to unicode instead of ebytes, then you:
1) can't preserve data that doesn't have a 1:1 mapping to unicode, and
ebytes should be immutable like bytes and str. So you shouldn't lose the data if you keep a reference to it.
2) can't detect an error until your data reaches the output point in your application -- forcing you to defensively insert ebytes calls everywhere (vs. simply wrapping them around a handful of designated inputs), or else have to go right back to tracing down where the unusable data showed up in the first place.
Usually, you don't want to know where you are combining two incompatible strings. Instead, you want to know where the incompatible strings are being set in the first place. If function(a, b) tracebacks with certain combinations of a and b I need to know where a and b are being set, not where function(a, b) is in the source code. So you need to be making input values ebytes() (or str in current python3) no matter what.
One thing that seems like a bit of a blind spot for some folks is that having unicode is *not* everybody's goal. Not because we don't believe unicode is generally a good thing or anything like that, but because we have to work with systems that flat out don't *do* unicode, thereby making the presence of (fully-general) unicode an error condition that has to be stamped out!
I think that sometimes as well. However, here I think you're in a bit of a blind spot yourself. I'm saying that making ebytes + str coerce to ebytes will only yield a traceback some of the time; which is the python2 behaviour. Having ebytes + str coerce to str will never throw a traceback as long as our implementation checks that the bytes and encoding work together fro mthe start. Throwing an error in code, only on some input is one of the main reasons that debugging unicode vs byte issues sucks on python2. On my box, with my dataset, everything works. Toss it up on pypi and suddenly I have a user in Japan who reports that he gets a traceback with his dataset that he can't give to me because it's proprietary, overly large, or transient.
IOW, if you're producing output that has to go into another system that doesn't take unicode, it doesn't matter how theoretically-correct it would be for your app to process the data in unicode form. In that case, unicode is not a feature: it's a bug.
This is not always true. If you read a webpage, chop it up so you get a list of words, create a histogram of word length, and then write the output as utf8 to a database. Should you do all your intermediate string operations on utf8 encoded byte strings? No, you should do them on unicode strings as otherwise you need to know about the details of how utf8 encodes characters.
And as it really *is* an error in that case, it should not pass silently, unless explicitly silenced.
This is very true -- although the python3 stdlib does explicitly silence errors related to unicode in some cases. Anyhow -- IMHO, you should get a TypeError when you attempt to pass a unicode value into a function that is meant to work with bytes. (You can accept an ebytes object as well since it has a known bytes representation). -Toshio