Re: [Python-Dev] email package status in 3.X
At 08:08 AM 6/21/2010 +1000, Nick Coghlan wrote:
Perhaps if people could identify which specific string methods are causing problems?
__getitem__(int) returns an integer rather than a bytestring, so anything that manipulates individual characters can't be given bytes and have it work. That was one of the key differences I had in mind for a bstr type, apart from designing it to coerce normal strings to bstrs in cross-type operations, and to allow O(1) "conversion" to/from bytes. Another randomly chosen byte/string incompatibility (Python 3.1; I don't have 3.2 handy at the moment):
os.path.join(b'x','y') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\Python31\lib\ntpath.py", line 161, in join if b[:1] in seps: TypeError: Type str doesn't support the buffer API
os.path.join('x',b'y') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\Python31\lib\ntpath.py", line 161, in join if b[:1] in seps: TypeError: 'in <string>' requires string as left operand, not bytes
Ironically, it seems to me that in trying to make the type distinction more rigid, Py3K fails in this area precisely because it is not a rigidly typed language in the Java or Haskell sense: i.e., os.path.join doesn't say, "I need two stringlike objects of the *same type*", not even in its docstring. At least in Java, you would either implement a "path" type with coercions from bytes and strings, or you'd have a class with overloaded methods for handling join operations on bytes and strings, respectively, thereby avoiding this whole mess. (Alas, this little example on the 'in' operator also shows that my bstr effort would probably fail anyway, because there's no '__rcontains__' (__lcontains__?) to allow it to override the str type's __contains__.)
On Mon, Jun 21, 2010 at 11:58 AM, P.J. Eby <pje@telecommunity.com> wrote:
At 08:08 AM 6/21/2010 +1000, Nick Coghlan wrote:
Perhaps if people could identify which specific string methods are causing problems?
__getitem__(int) returns an integer rather than a bytestring, so anything that manipulates individual characters can't be given bytes and have it work.
It can if you use length one slices rather than simple indexing. Depending on the details, such algorithms may still fail for multi-byte codecs though.
That was one of the key differences I had in mind for a bstr type, apart from designing it to coerce normal strings to bstrs in cross-type operations, and to allow O(1) "conversion" to/from bytes.
Erk, that just sounds like a recipe for recreating the problems 2.x has in a new form.
Another randomly chosen byte/string incompatibility (Python 3.1; I don't have 3.2 handy at the moment):
os.path.join(b'x','y') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\Python31\lib\ntpath.py", line 161, in join if b[:1] in seps: TypeError: Type str doesn't support the buffer API
os.path.join('x',b'y') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\Python31\lib\ntpath.py", line 161, in join if b[:1] in seps: TypeError: 'in <string>' requires string as left operand, not bytes
Ironically, it seems to me that in trying to make the type distinction more rigid, Py3K fails in this area precisely because it is not a rigidly typed language in the Java or Haskell sense: i.e., os.path.join doesn't say, "I need two stringlike objects of the *same type*", not even in its docstring.
I believe it actually needs the objects to be compatible with the type of os.sep, rather than just with each other (i.e. the type restrictions on os.path.join are the same as those on os.sep.join, even though the join algorithm itself is slightly different). This restriction should be mentioned in the Py3k docstring and docs for os.path.join - if it isn't, that would be a doc bug.
At least in Java, you would either implement a "path" type with coercions from bytes and strings, or you'd have a class with overloaded methods for handling join operations on bytes and strings, respectively, thereby avoiding this whole mess.
(Alas, this little example on the 'in' operator also shows that my bstr effort would probably fail anyway, because there's no '__rcontains__' (__lcontains__?) to allow it to override the str type's __contains__.)
OK, these examples convince me that the incompatibility problem is real. However, I don't think a bstr type can solve them even without the __rcontains__ problem - it would just recreate the pain that we already have in the 2.x world. Something that may make sense to ease the porting process is for some of these "on the boundary" I/O related string manipulation functions (such as os.path.join) to grow "encoding" keyword-only arguments. The recommended approach would be to provide all strings, but bytes could also be accepted if an encoding was specified. (If you want to mix encodings - tough, do the decoding yourself). For the idea of avoiding excess copying of bytes through multiple encoding/decoding calls... isn't that meant to be handled at an architectural level (i.e. decode once on the way in, encode once on the way out)? Optimising the single-byte codec case by minimising data copying (possibly through creative use of PEP 3118) may be something that we want to look at eventually, but it strikes me as something of a premature optimisation at this point in time (i.e. the old adage "first get it working, then get it working fast"). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Jun 21, 2010, at 10:20 PM, Nick Coghlan wrote:
Something that may make sense to ease the porting process is for some of these "on the boundary" I/O related string manipulation functions (such as os.path.join) to grow "encoding" keyword-only arguments. The recommended approach would be to provide all strings, but bytes could also be accepted if an encoding was specified. (If you want to mix encodings - tough, do the decoding yourself).
This is probably a stupid idea, and if so I'll plead Monday morning mindfuzz for it. Would it make sense to have "encoding-carrying" bytes and str types? Basically, I'm thinking of types (maybe even the current ones) that carry around a .encoding attribute so that they can be automatically encoded and decoded where necessary. This at least would simplify APIs that need to do the conversion. By default, the .encoding attribute would be some marker to indicated "I have no idea, do it explicitly" and if you combine ebytes or estrs that have incompatible encodings, you'd either throw an exception or reset the .encoding to IAmConfuzzled. But say you had an email header like: =?euc-jp?b?pc+l7aG8pe+hvKXrpcmhqg==?= And code like the following (made less crappy): -----snip snip----- class ebytes(bytes): encoding = 'ascii' def __str__(self): s = estr(self.decode(self.encoding)) s.encoding = self.encoding return s class estr(str): encoding = 'ascii' s = str(b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa', 'euc-jp') b = bytes(s, 'euc-jp') eb = ebytes(b) eb.encoding = 'euc-jp' es = str(eb) print(repr(eb), es, es.encoding) -----snip snip----- Running this you get: b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa' ハローワールド! euc-jp Would it be feasible? Dunno. Would it help ease the bytes/str confusion? Dunno. But I think it would help make APIs easier to design and use because it would cut down on the encoding-keyword function signature infection. -Barry
On Mon, Jun 21, 2010 at 11:43:07AM -0400, Barry Warsaw wrote:
On Jun 21, 2010, at 10:20 PM, Nick Coghlan wrote:
Something that may make sense to ease the porting process is for some of these "on the boundary" I/O related string manipulation functions (such as os.path.join) to grow "encoding" keyword-only arguments. The recommended approach would be to provide all strings, but bytes could also be accepted if an encoding was specified. (If you want to mix encodings - tough, do the decoding yourself).
This is probably a stupid idea, and if so I'll plead Monday morning mindfuzz for it.
Would it make sense to have "encoding-carrying" bytes and str types? Basically, I'm thinking of types (maybe even the current ones) that carry around a .encoding attribute so that they can be automatically encoded and decoded where necessary. This at least would simplify APIs that need to do the conversion.
By default, the .encoding attribute would be some marker to indicated "I have no idea, do it explicitly" and if you combine ebytes or estrs that have incompatible encodings, you'd either throw an exception or reset the .encoding to IAmConfuzzled. But say you had an email header like:
=?euc-jp?b?pc+l7aG8pe+hvKXrpcmhqg==?=
And code like the following (made less crappy):
-----snip snip----- class ebytes(bytes): encoding = 'ascii'
def __str__(self): s = estr(self.decode(self.encoding)) s.encoding = self.encoding return s
class estr(str): encoding = 'ascii'
s = str(b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa', 'euc-jp') b = bytes(s, 'euc-jp')
eb = ebytes(b) eb.encoding = 'euc-jp' es = str(eb) print(repr(eb), es, es.encoding) -----snip snip-----
Running this you get:
b'\xa5\xcf\xa5\xed\xa1\xbc\xa5\xef\xa1\xbc\xa5\xeb\xa5\xc9\xa1\xaa' ハローワールド! euc-jp
Would it be feasible? Dunno. Would it help ease the bytes/str confusion? Dunno. But I think it would help make APIs easier to design and use because it would cut down on the encoding-keyword function signature infection.
I like the idea of having encoding information carried with the data. I don't think that an ebytes type that can *optionally* have an encoding attribute makes the situation less confusing, though. To me the biggest problem with python-2.x's unicode/bytes handling was not that it threw exceptions but that it didn't always throw exceptions. You might test this in python2:: t = u'cafe' function(t) And say, ah my code works. Then a user gives it this:: t = u'café' function(t) And get a unicode error because the function only works with unicode in the ascii range. ebytes seems to have the same pitfall where the code path exercised by your tests could work with:: eb = ebytes(b) eb.encoding = 'euc-jp' function(eb) but the user exercises a code path that does this and fails:: eb = ebytes(b) function(eb) What do you think of making the encoding attribute a mandatory part of creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``). -Toshio
At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
What do you think of making the encoding attribute a mandatory part of creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``).
As long as the coercion rules force str+ebytes (or str % ebytes, ebytes % str, etc.) to result in another ebytes (and fail if the str can't be encoded in the ebytes' encoding), I'm personally fine with it, although I really like the idea of tacking the encoding to bytes objects in the first place. OTOH, one potential problem with having the encoding on the bytes object rather than the ebytes object is that then you can't easily take bytes from a socket and then say what encoding they are, without interfering with the sockets API (or whatever other place you get the bytes from). So, on balance, making ebytes a separate type (perhaps one that's just a pointer to the bytes and a pointer to the encoding) would indeed make more sense. It having different coercion rules for interacting with strings would make more sense too in that case. (The ideal, of course, would still be to not let bytes objects be stringlike at all, with only ebytes acting string-like. That way, you'd be forced to be explicit about your encoding when working with bytes, but all you'd need to do was make an ebytes call.)
On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote:
At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
What do you think of making the encoding attribute a mandatory part of creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``).
As long as the coercion rules force str+ebytes (or str % ebytes, ebytes % str, etc.) to result in another ebytes (and fail if the str can't be encoded in the ebytes' encoding), I'm personally fine with it, although I really like the idea of tacking the encoding to bytes objects in the first place.
I wouldn't like this. It brings us back to the python2 problem where sometimes you pass an ebyte into a function and it works and other times you pass an ebyte into the function and it issues a traceback. The coercion must end up with a str and no traceback (this assumes that we've checked that the ebyte and the encoding "match" when we create the ebyte). If you want bytes out the other end, you should either have a different function or explicitly transform the output from str to bytes. So, what's the advantage of using ebytes instead of bytes? * It keeps together the text and encoding information when you're taking bytes in and want to give bytes back under the same encoding. * It takes some of the boilerplate that people are supposed to do (checking that bytes are legal in a specific encoding) and writes it into the initialization of the object. That forces you to think about the issue at two points in the code: when converting into ebytes and when converting out to bytes. For data that's going to be used with both str and bytes, this is the accepted best practice. (For exceptions, the byte type remains which you can do conversion on when you want to). -Toshio
On Jun 21, 2010, at 03:29 PM, Toshio Kuratomi wrote:
I wouldn't like this. It brings us back to the python2 problem where sometimes you pass an ebyte into a function and it works and other times you pass an ebyte into the function and it issues a traceback. The coercion must end up with a str and no traceback (this assumes that we've checked that the ebyte and the encoding "match" when we create the ebyte).
Doing this at ebyte construction time does have the nice benefit of getting the exception early, and because the ebyte is unmutable, you could cache the results in an attribute on the ebyte. Well, unmutable if the .encoding is also unmutable. If that can change, then you'd have to re-run the cached decoding whenever the attribute were set, and there would be a penalty paid each time this was done. That, plus the socket use case, does argue for a separate ebytes type. -Barry
At 03:29 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote:
At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
What do you think of making the encoding attribute a mandatory part of creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``).
As long as the coercion rules force str+ebytes (or str % ebytes, ebytes % str, etc.) to result in another ebytes (and fail if the str can't be encoded in the ebytes' encoding), I'm personally fine with it, although I really like the idea of tacking the encoding to bytes objects in the first place.
I wouldn't like this. It brings us back to the python2 problem where sometimes you pass an ebyte into a function and it works and other times you pass an ebyte into the function and it issues a traceback.
For stdlib functions, this isn't going to happen unless your ebytes' encoding is not compatible with the ascii subset of unicode, or the stdlib function is working with dynamic data... in which case you really *do* want to fail early! I don't see this as a repeat of the 2.x situation; rather, it allows you to cause errors to happen much *earlier* than they would otherwise show up if you were using unicode for your encoded-bytes data. For example, if your program's intent is to end up with latin-1 output, then it would be better for an error to show up at the very *first* point where non-latin1 characters are mixed with your data, rather than only showing up at the output boundary! However, if you promoted mixed-type operation results to unicode instead of ebytes, then you: 1) can't preserve data that doesn't have a 1:1 mapping to unicode, and 2) can't detect an error until your data reaches the output point in your application -- forcing you to defensively insert ebytes calls everywhere (vs. simply wrapping them around a handful of designated inputs), or else have to go right back to tracing down where the unusable data showed up in the first place. One thing that seems like a bit of a blind spot for some folks is that having unicode is *not* everybody's goal. Not because we don't believe unicode is generally a good thing or anything like that, but because we have to work with systems that flat out don't *do* unicode, thereby making the presence of (fully-general) unicode an error condition that has to be stamped out! IOW, if you're producing output that has to go into another system that doesn't take unicode, it doesn't matter how theoretically-correct it would be for your app to process the data in unicode form. In that case, unicode is not a feature: it's a bug. And as it really *is* an error in that case, it should not pass silently, unless explicitly silenced.
So, what's the advantage of using ebytes instead of bytes?
* It keeps together the text and encoding information when you're taking bytes in and want to give bytes back under the same encoding. * It takes some of the boilerplate that people are supposed to do (checking that bytes are legal in a specific encoding) and writes it into the initialization of the object. That forces you to think about the issue at two points in the code: when converting into ebytes and when converting out to bytes. For data that's going to be used with both str and bytes, this is the accepted best practice. (For exceptions, the byte type remains which you can do conversion on when you want to).
Hm. For the output case, I suppose that means you might also want the text I/O wrappers to be able to be strict about ebytes' encoding.
On Mon, Jun 21, 2010 at 04:09:52PM -0400, P.J. Eby wrote:
At 03:29 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote:
At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote:
What do you think of making the encoding attribute a mandatory part of creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``).
As long as the coercion rules force str+ebytes (or str % ebytes, ebytes % str, etc.) to result in another ebytes (and fail if the str can't be encoded in the ebytes' encoding), I'm personally fine with it, although I really like the idea of tacking the encoding to bytes objects in the first place.
I wouldn't like this. It brings us back to the python2 problem where sometimes you pass an ebyte into a function and it works and other times you pass an ebyte into the function and it issues a traceback.
For stdlib functions, this isn't going to happen unless your ebytes' encoding is not compatible with the ascii subset of unicode, or the stdlib function is working with dynamic data... in which case you really *do* want to fail early!
The ebytes encoding will often be incompatible with the ascii subset. It's the reason that people were so often tempted to change the defaultencoding on python2 to utf8.
I don't see this as a repeat of the 2.x situation; rather, it allows you to cause errors to happen much *earlier* than they would otherwise show up if you were using unicode for your encoded-bytes data.
For example, if your program's intent is to end up with latin-1 output, then it would be better for an error to show up at the very *first* point where non-latin1 characters are mixed with your data, rather than only showing up at the output boundary!
That highly depends on your usage. If you're formatting a comment on a web page, checking at output and replacing with '?' is better than a traceback. If you're entering key values into a database, then you likely want to know where the non-latin1 data is entering your program, not where it's mixed with your data or the output boundary.
However, if you promoted mixed-type operation results to unicode instead of ebytes, then you:
1) can't preserve data that doesn't have a 1:1 mapping to unicode, and
ebytes should be immutable like bytes and str. So you shouldn't lose the data if you keep a reference to it.
2) can't detect an error until your data reaches the output point in your application -- forcing you to defensively insert ebytes calls everywhere (vs. simply wrapping them around a handful of designated inputs), or else have to go right back to tracing down where the unusable data showed up in the first place.
Usually, you don't want to know where you are combining two incompatible strings. Instead, you want to know where the incompatible strings are being set in the first place. If function(a, b) tracebacks with certain combinations of a and b I need to know where a and b are being set, not where function(a, b) is in the source code. So you need to be making input values ebytes() (or str in current python3) no matter what.
One thing that seems like a bit of a blind spot for some folks is that having unicode is *not* everybody's goal. Not because we don't believe unicode is generally a good thing or anything like that, but because we have to work with systems that flat out don't *do* unicode, thereby making the presence of (fully-general) unicode an error condition that has to be stamped out!
I think that sometimes as well. However, here I think you're in a bit of a blind spot yourself. I'm saying that making ebytes + str coerce to ebytes will only yield a traceback some of the time; which is the python2 behaviour. Having ebytes + str coerce to str will never throw a traceback as long as our implementation checks that the bytes and encoding work together fro mthe start. Throwing an error in code, only on some input is one of the main reasons that debugging unicode vs byte issues sucks on python2. On my box, with my dataset, everything works. Toss it up on pypi and suddenly I have a user in Japan who reports that he gets a traceback with his dataset that he can't give to me because it's proprietary, overly large, or transient.
IOW, if you're producing output that has to go into another system that doesn't take unicode, it doesn't matter how theoretically-correct it would be for your app to process the data in unicode form. In that case, unicode is not a feature: it's a bug.
This is not always true. If you read a webpage, chop it up so you get a list of words, create a histogram of word length, and then write the output as utf8 to a database. Should you do all your intermediate string operations on utf8 encoded byte strings? No, you should do them on unicode strings as otherwise you need to know about the details of how utf8 encodes characters.
And as it really *is* an error in that case, it should not pass silently, unless explicitly silenced.
This is very true -- although the python3 stdlib does explicitly silence errors related to unicode in some cases. Anyhow -- IMHO, you should get a TypeError when you attempt to pass a unicode value into a function that is meant to work with bytes. (You can accept an ebytes object as well since it has a known bytes representation). -Toshio
...
IOW, if you're producing output that has to go into another system that doesn't take unicode, it doesn't matter how theoretically-correct it would be for your app to process the data in unicode form. In that case, unicode is not a feature: it's a bug.
This is not always true. If you read a webpage, chop it up so you get a list of words, create a histogram of word length, and then write the output as utf8 to a database. Should you do all your intermediate string operations on utf8 encoded byte strings? No, you should do them on unicode strings as otherwise you need to know about the details of how utf8 encodes characters.
You'd still have problems in Unicode given stuff like å =~ å even though u'\xe5' vs u'a\u030a' (those will look the same depending on your Unicode system. IDLE shows them pretty much the same, T-Bird on Windosw with my current font shows the second as 2 characters.) I realize this was a toy example, but it does point out that Unicode complicates the idea of 'equality' as well as the idea of 'what is a character'. And just saying "decode it to Unicode" isn't really sufficient. John =:->
On Mon, Jun 21, 2010 at 04:52:08PM -0500, John Arbash Meinel wrote:
...
IOW, if you're producing output that has to go into another system that doesn't take unicode, it doesn't matter how theoretically-correct it would be for your app to process the data in unicode form. In that case, unicode is not a feature: it's a bug.
This is not always true. If you read a webpage, chop it up so you get a list of words, create a histogram of word length, and then write the output as utf8 to a database. Should you do all your intermediate string operations on utf8 encoded byte strings? No, you should do them on unicode strings as otherwise you need to know about the details of how utf8 encodes characters.
You'd still have problems in Unicode given stuff like å =~ å even though u'\xe5' vs u'a\u030a' (those will look the same depending on your Unicode system. IDLE shows them pretty much the same, T-Bird on Windosw with my current font shows the second as 2 characters.)
I realize this was a toy example, but it does point out that Unicode complicates the idea of 'equality' as well as the idea of 'what is a character'. And just saying "decode it to Unicode" isn't really sufficient.
Ah -- but if you're dealing with unicode objects you can use the unicodedata.normalize() function on them to come out with the right values. If you're using bytes, it's yet another case where you, the programmer, have to know what byte sequences represent combining characters in the particular encoding that you're dealing with. -Toshio
On Tue, 22 Jun 2010 06:09:52 am P.J. Eby wrote:
However, if you promoted mixed-type operation results to unicode instead of ebytes, then you:
1) can't preserve data that doesn't have a 1:1 mapping to unicode,
Sounds like exactly the sort of thing the Unicode private codepoints were invented for, as Toshio suggests. In any case, if there are use-cases for text that aren't solved by Unicode, and I'm not convinced that there are, Python doesn't need to solve them. At the very least, such a solution should start off as a third-party package to prove itself before being made a part of the standard library, let alone a built-in. -- Steven D'Aprano
On Jun 21, 2010, at 01:24 PM, P.J. Eby wrote:
OTOH, one potential problem with having the encoding on the bytes object rather than the ebytes object is that then you can't easily take bytes from a socket and then say what encoding they are, without interfering with the sockets API (or whatever other place you get the bytes from).
Unless the default was the "I don't know" marker and you were able to set it after you've done whatever kind of application-level calculation you needed to do. -Barry
At 04:04 PM 6/21/2010 -0400, Barry Warsaw wrote:
On Jun 21, 2010, at 01:24 PM, P.J. Eby wrote:
OTOH, one potential problem with having the encoding on the bytes object rather than the ebytes object is that then you can't easily take bytes from a socket and then say what encoding they are, without interfering with the sockets API (or whatever other place you get the bytes from).
Unless the default was the "I don't know" marker and you were able to set it after you've done whatever kind of application-level calculation you needed to do.
True, but making it a separate type with a required encoding gets rid of the magical "I don't know" - the "I don't know" encoding is just a plain old bytes object. (In principle, you could then drop *all* the stringlike methods from plain-old-bytes objects. If it's really text-in-bytes you want, you should use an ebytes with the encoding specified.)
On Jun 21, 2010, at 04:16 PM, P.J. Eby wrote:
At 04:04 PM 6/21/2010 -0400, Barry Warsaw wrote:
On Jun 21, 2010, at 01:24 PM, P.J. Eby wrote:
OTOH, one potential problem with having the encoding on the bytes object rather than the ebytes object is that then you can't easily take > bytes from a socket and then say what encoding they are, without interfering with the sockets API (or whatever other place you get the bytes from).
Unless the default was the "I don't know" marker and you were able to set it after you've done whatever kind of application-level calculation you needed to do.
True, but making it a separate type with a required encoding gets rid of the magical "I don't know" - the "I don't know" encoding is just a plain old bytes object.
(In principle, you could then drop *all* the stringlike methods from plain-old-bytes objects. If it's really text-in-bytes you want, you should use an ebytes with the encoding specified.)
Yep, agreed! -Barry
On Tue, Jun 22, 2010 at 6:16 AM, P.J. Eby <pje@telecommunity.com> wrote:
True, but making it a separate type with a required encoding gets rid of the magical "I don't know" - the "I don't know" encoding is just a plain old bytes object.
So, to boil down the ebytes idea, it is basically a request for a second string type that holds an octet stream plus an encoding name, rather than a Unicode character stream. Calling it "ebytes" seems to emphasise the wrong parallel in that case (you have a 'str' object with a different internal structure, not any kind of bytes object). For now I'll call it an "altstr". Then the idea can be described as - altstr would expose the same API as str, NOT the same API as bytes - explicit conversion via "str" would use the altstr's __str__ method - explicit conversion via "bytes" would use the altstr's __bytes__ method - implicit interaction with str would convert the str to an altstr object according to the altstr's rules. This may be best handled via a coercion method on altstr, rather than str actually needing to know the details (i.e. an altrstr.__coerce_str__() method). For the 'ebytes' model, this would do something like "type(self)(other.encode(self.encoding), self.encoding))". The operation would then be handled by the corresponding method on the coerced object. A new type could then override operations such as __contains__, __mod__, format() and join(). This is still smelling an awful lot like the 2.x str type to me, but supporting a __coerce_str__ method may allow some useful experimentation in this space (as PJE suggested). There's a chance it would be abused, but it offers a greater chance of success than trying to come up with a concrete altstr type without providing a means for experimentation first.
(In principle, you could then drop *all* the stringlike methods from plain-old-bytes objects. If it's really text-in-bytes you want, you should use an ebytes with the encoding specified.)
Except that a lot of those string-like methods are just plain useful, even when you *know* you're dealing with an octet stream rather than latin-1 encoded text. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Jun 22, 2010, at 08:03 AM, Nick Coghlan wrote:
On Tue, Jun 22, 2010 at 6:16 AM, P.J. Eby <pje@telecommunity.com> wrote:
True, but making it a separate type with a required encoding gets rid of the magical "I don't know" - the "I don't know" encoding is just a plain old bytes object.
So, to boil down the ebytes idea, it is basically a request for a second string type that holds an octet stream plus an encoding name, rather than a Unicode character stream. Calling it "ebytes" seems to emphasise the wrong parallel in that case (you have a 'str' object with a different internal structure, not any kind of bytes object). For now I'll call it an "altstr". Then the idea can be described as
Actually no. We're introducing a second bytes type that holds an octet stream plus an encoding name. See the toy implementation I included in a previous message. As opposed to say a bytes object that represented an image, which would make almost no sense to decode to a unicode, this ebytes type would help bridge the gap between a pure bytes object and a pure unicode object. It would know how to accurately convert to a unicode (i.e. __str__()) because it would know the encoding of the bytes. Obviously, it could convert to a pure bytes object. Because it can be accurately stringified, it can have the most if not all of the str API. -Barry
On Tue, 22 Jun 2010 08:03:58 am Nick Coghlan wrote:
On Tue, Jun 22, 2010 at 6:16 AM, P.J. Eby <pje@telecommunity.com> wrote:
True, but making it a separate type with a required encoding gets rid of the magical "I don't know" - the "I don't know" encoding is just a plain old bytes object.
So, to boil down the ebytes idea, it is basically a request for a second string type that holds an octet stream plus an encoding name, rather than a Unicode character stream.
Do any other languages have any equivalent to this ebtyes type? If not, how do they deal with this issue? [...]
This is still smelling an awful lot like the 2.x str type to me
Yes. Virtually the only difference I can see is that it lets the user set a per-object default encoding to use when coercing strings to and from bytes. If this is not the case, can somebody please explain what I'm missing? -- Steven D'Aprano
Steven D'Aprano:
Do any other languages have any equivalent to this ebtyes type?
The String type in Ruby 1.9 is a byte string with an encoding attribute. Most online Ruby documentation is for 1.8 but the API can be examined here: http://ruby-doc.org/ruby-1.9/index.html Here's something more explanatory: http://blog.grayproductions.net/articles/ruby_19s_string My view is that this actually makes things much more complex by making encoding combination an n*n problem (where n is the number of encodings) rather an n sized problem when you have a single core string type Neil
On Jun 21, 2010, at 12:34 PM, Toshio Kuratomi wrote:
I like the idea of having encoding information carried with the data. I don't think that an ebytes type that can *optionally* have an encoding attribute makes the situation less confusing, though.
Agreed. I think the attribute should always be there, but there probably needs to be a magic value (perhaps None) that indicates and unknown, manual, garbage, error, broken encoding. Examples: you read bytes off a socket and don't know what the encoding is; you concatenate two ebytes that have incompatible encodings.
To me the biggest problem with python-2.x's unicode/bytes handling was not that it threw exceptions but that it didn't always throw exceptions. You might test this in python2:: t = u'cafe' function(t)
And say, ah my code works. Then a user gives it this:: t = u'café' function(t)
And get a unicode error because the function only works with unicode in the ascii range.
That's an excellent point.
ebytes seems to have the same pitfall where the code path exercised by your tests could work with:: eb = ebytes(b) eb.encoding = 'euc-jp' function(eb)
but the user exercises a code path that does this and fails:: eb = ebytes(b) function(eb)
What do you think of making the encoding attribute a mandatory part of creating an ebyte object? (ex: ``eb = ebytes(b, 'euc-jp')``).
If ebytes is a separate type, then definitely +1. If 'ebytes is bytes' then I'd probably want to default the second argument to the magical "i-don't-know' marker. -Barry
Barry Warsaw wrote:
On Jun 21, 2010, at 12:34 PM, Toshio Kuratomi wrote:
I like the idea of having encoding information carried with the data. I don't think that an ebytes type that can *optionally* have an encoding attribute makes the situation less confusing, though.
Agreed. I think the attribute should always be there, but there probably needs to be a magic value (perhaps None) that indicates and unknown, manual, garbage, error, broken encoding.
Examples: you read bytes off a socket and don't know what the encoding is; you concatenate two ebytes that have incompatible encodings.
Such extra information tends to be lost whenever you pass the bytes data through a C level API or some other function that doesn't know about the special nature of those objects, treating them just like any bytes object. It may sound nice in theory, but in practice it doesn't work out. Besides, if you do know the encoding, you can easily carry the data around in a Unicode str object. The problem lies elsewhere: What to do with a piece of text for which you don't know the encoding and how to combine that piece of text with other pieces of text for which you do know the encoding. There are a few options at hand: * you keep working on the bytes data and only convert things to Unicode when needed and where the encoding is known * you decode the bytes data for which you don't have the encoding information into some special Unicode form (eg. using the surrogateescape error handler) and hope that when the time comes to encode the Unicode data back into bytes, the codec supports reversing the conversion * you manage the data as a list of Unicode str and bytes objects and don't even try to be clever about encodings of text without unknown encoding It depends a lot on the use case, which of these options fits best.
To me the biggest problem with python-2.x's unicode/bytes handling was not that it threw exceptions but that it didn't always throw exceptions. You might test this in python2:: t = u'cafe' function(t)
And say, ah my code works. Then a user gives it this:: t = u'café' function(t)
And get a unicode error because the function only works with unicode in the ascii range.
That's an excellent point.
Here's a little known fact: by changing the Python2 default encoding to 'undefined' (yes, that's a real codec !), you can disable all automatic string coercion in Python2. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 21 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2010-07-19: EuroPython 2010, Birmingham, UK 27 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Jun 21, 2010, at 4:29 PM, M.-A. Lemburg wrote:
Here's a little known fact: by changing the Python2 default encoding to 'undefined' (yes, that's a real codec !), you can disable all automatic string coercion in Python2.
I tried that once: half the stdlib stops working if you do (for example, the re module), so it's not particularly useful for checking if your own code is unicode-safe. James
At 11:43 AM 6/21/2010 -0400, Barry Warsaw wrote:
On Jun 21, 2010, at 10:20 PM, Nick Coghlan wrote:
Something that may make sense to ease the porting process is for some of these "on the boundary" I/O related string manipulation functions (such as os.path.join) to grow "encoding" keyword-only arguments. The recommended approach would be to provide all strings, but bytes could also be accepted if an encoding was specified. (If you want to mix encodings - tough, do the decoding yourself).
This is probably a stupid idea, and if so I'll plead Monday morning mindfuzz for it.
Would it make sense to have "encoding-carrying" bytes and str types?
It's not a stupid idea, and could potentially work. It also might have a better chance of being able to actually be *implemented* in 3.x than my idea.
Basically, I'm thinking of types (maybe even the current ones) that carry around a .encoding attribute so that they can be automatically encoded and decoded where necessary. This at least would simplify APIs that need to do the conversion.
I'm not really sure how much use the encoding is on a unicode object - what would it actually mean? Hm. I suppose it would effectively mean "this string can be represented in this encoding" -- which is useful, in that you could fail operations when combining with bytes of a different encoding. Hm... no, in that case you should just encode the string to the bytes' encoding, and let that throw an error if it fails. So, really, there's no reason for a string to know its encoding. All you need is the bytes type to have an encoding attribute, and when doing mixed-type operations between bytes and strings, coerce to *bytes of the same encoding*. However, if .encoding is None, then coercion would follow the same rules as now -- i.e., convert the bytes to unicode, assuming an ascii encoding. (This would be different than setting an encoding of 'ascii', because in that case, it means you want cross-type operations to result in ascii bytes, rather than a unicode string, and to fail if the unicode part can't be encoded appropriately. The 'None' setting is effectively a nod to compatibility with prior 3.x versions, since I assume we can't just throw out the old coercion behavior.) Then, a few more changes to the bytes type would round out the implementation: * Allow .decode() to not specify an encoding, unless .encoding is None * Add back in the missing string methods (e.g. .encode()), since you can transparently upgrade to a string) * Smart __str__, as shown in your proposal.
Would it be feasible? Dunno.
Probably, although it might mean adding back in special cases that were previously taken out, and a few new ones.
Would it help ease the bytes/str confusion? Dunno.
Not sure what confusion you mean -- Web-SIG and I at least are not confused about the difference between bytes and str, or we wouldn't be having an issue. ;-) Or maybe you mean the stdlib's API confusion? In which case, yes, definitely!
But I think it would help make APIs easier to design and use because it would cut down on the encoding-keyword function signature infection.
Not only that, but I believe it would also retroactively make the stdlib's implementation of those APIs "correct" again, and give us One Obvious Way to work with bytes of a known encoding, while constraining any unicode that gets combined with those bytes to be validly encodable. It also gives you an idempotent constructor for bytes of a specified encoding, that can take either a bytes of unspecified encoding, a bytes of the correct encoding, or a string that can be encoded as such. In short, +1. (I wish it were possible to go back and make bytes non-strings and have only this ebytes or bstr or whatever type have string methods, but I'm pretty sure that ship has already sailed.)
On Jun 21, 2010, at 01:17 PM, P.J. Eby wrote:
I'm not really sure how much use the encoding is on a unicode object - what would it actually mean?
Hm. I suppose it would effectively mean "this string can be represented in this encoding" -- which is useful, in that you could fail operations when combining with bytes of a different encoding.
That's basically what I was thinking.
Hm... no, in that case you should just encode the string to the bytes' encoding, and let that throw an error if it fails. So, really, there's no reason for a string to know its encoding. All you need is the bytes type to have an encoding attribute, and when doing mixed-type operations between bytes and strings, coerce to *bytes of the same encoding*.
If ebytes were a separate type, and it did the encoding check at constructor time, and the results of the decoding were cached, then I think you would not need the equivalent of an estr type. If you had a string and knew what it could be encoded to, then you could just coerce it to an ebytes and use the cached decoded value wherever you needed it. E.g. >>> mystring = 'some unicode string' >>> myencoding = 'iso-9999-foo' >>> myebytes = ebytes(mystring, myencoding) >>> myebytes.encoding == myencoding True >>> myebytes.string == mystring True So ebytes() could accept a str or bytes as its first argument. >>> mybytes = b'some encoded string' >>> myebytes = ebytes(mybytes, myencoding) >>> mybytes == myebytes True >>> myebytes.encoding == myencoding True In the first example ebytes() encodes mystring to set the internal bytes representation. In the second example, ebytes() decodes the bytes to get the .string attribute value. In both cases, an exception is raised if the encoding/decoding fails.
However, if .encoding is None, then coercion would follow the same rules as now -- i.e., convert the bytes to unicode, assuming an ascii encoding. (This would be different than setting an encoding of 'ascii', because in that case, it means you want cross-type operations to result in ascii bytes, rather than a unicode string, and to fail if the unicode part can't be encoded appropriately. The 'None' setting is effectively a nod to compatibility with prior 3.x versions, since I assume we can't just throw out the old coercion behavior.)
Then, a few more changes to the bytes type would round out the implementation:
* Allow .decode() to not specify an encoding, unless .encoding is None
* Add back in the missing string methods (e.g. .encode()), since you can transparently upgrade to a string)
* Smart __str__, as shown in your proposal.
If my example above isn't nonsense, then __str__() would just return the .string attribute.
In short, +1. (I wish it were possible to go back and make bytes non-strings and have only this ebytes or bstr or whatever type have string methods, but I'm pretty sure that ship has already sailed.)
Maybe it's PEP time? No, I'm not volunteering. ;) -Barry
On 6/21/2010 11:43 AM, Barry Warsaw wrote:
This is probably a stupid idea, and if so I'll plead Monday morning mindfuzz for it.
Would it make sense to have "encoding-carrying" bytes and str types?
On 2009-11-5 I posted 'Add encoding attribute to bytes' to python-ideas. It was shot down at the time. Terry Jan Reedy
At 01:36 PM 6/21/2010 -0400, Terry Reedy wrote:
On 6/21/2010 11:43 AM, Barry Warsaw wrote:
This is probably a stupid idea, and if so I'll plead Monday morning mindfuzz for it.
Would it make sense to have "encoding-carrying" bytes and str types?
On 2009-11-5 I posted 'Add encoding attribute to bytes' to python-ideas. It was shot down at the time.
AFAICT, that's mainly for lack of apparent use cases, and also for confusion. Here, the use case (restoring the polymorphy of stdlib APIs) is pretty clear. However, if we had the string equivalent of a coercion protocol (that core strings and bytes would co-operate with), then it would enable people to write their own versions of either your idea or Barry's idea (or other things altogether), and still get the stdlib to play along. Personally, I think ebytes() would do the trick and it'd be nice to see it in stdlib, but gaining a string coercion protocol instead might not be a bad tradeoff. ;-)
Barry Warsaw writes:
Would it make sense to have "encoding-carrying" bytes and str types?
Why limit that to bytes and str? Why not have all objects carry their serializer/deserializer around with them? I think the answer is "no", though, because (1) it would constitute an attractive nuisance (the default would be abused, it would work fine in Kansas, and all hell would break loose in Kagoshima, simply delaying the pain and/or passing it on to third parties), and (2) you really want this under control of higher level objects that have access to some knowledge of the environment, rather than the lowest level.
At 03:08 AM 6/22/2010 +0900, Stephen J. Turnbull wrote:
Barry Warsaw writes:
Would it make sense to have "encoding-carrying" bytes and str types?
I think the answer is "no", though, because (1) it would constitute an attractive nuisance (the default would be abused, it would work fine in Kansas, and all hell would break loose in Kagoshima, simply delaying the pain and/or passing it on to third parties),
You have the proposal exactly backwards, actually. In Kagoshima, you'd use pass in an ebytes with your encoding to a stdlib API, and *get back an ebytes with the right encoding*, rather than an (incorrect and useless) unicode object which has lost data you need.
Why limit that to bytes and str? Why not have all objects carry their serializer/deserializer around with them?
Because it's not a serialization or deserialization. Your conceptual framework here implies that unicode objects are the real thing, and that bytes are "just" a way of transporting unicode around. But this is not the case at all, for use cases where "no, really, you *have to* work with bytes-encoded text streams". The mere release of Python 3.x will not cause all the world's applications, libraries, and protocols to suddenly work with unicode, where they did not before. Being explicit about the encoding of the bytes you're flinging around is actually an *increase* in specificity, explicitness, robustness, and error-checking ability over the status quo for either 2.x *or* 3.x... *and* it improves these qualities for essentially *all* string-handling code, without requiring that code to be rewritten to do so. It's like getting to use the time machine, really.
and (2) you really want this under control of higher level objects that have access to some knowledge of the environment, rather than the lowest level.
This proposal actually has such a higher-level object: an ebytes. And it passes that information *through* the lowest level, in such a way as to permit the stringlike operations to be fully polymorphic, without the information being lost inside somebody else's API.
P.J. Eby writes:
In Kagoshima, you'd use pass in an ebytes with your encoding to a stdlib API, and *get back an ebytes with the right encoding*, rather than an (incorrect and useless) unicode object which has lost data you need.
How does the stdlib do that? Unless it guesses which encoding for Japanese is being used? And even if this ebytes uses Shift JIS, what makes that the "right" encoding for anything? On the other hand, I know when *I* need some encoding, and when I figure it out I will store it in an appropriate place in my program. The problem is that for some programs it is not unlikely that I will see all of Shift JIS, EUC-JP, ISO-2022-JP, UTF-8, and UTF-16, and on a very bad day, RFC 2047, GB 2312, and Big5, too, used to encode Japanese. It's not totally unlikely for a browser to send URLs to a server expecting UTF-8 to recover a message/rfc822 object containing ISO-2022-JP in the mail header and EUC-JP in the body. So I need to know which encoding was used by the server that sent the reply, but the ebytes can't tell me that if it fishes an URL in EUC-JP out of the message body. I need to convert that URL to UTF-8, or most servers will 404.
But this is not the case at all, for use cases where "no, really, you *have to* work with bytes-encoded text streams". The mere release of Python 3.x will not cause all the world's applications, libraries, and protocols to suddenly work with unicode, where they did not before.
Sure. That's what .encode() and .decode() are for. The problem is what to do when you don't know what to put in the parentheses, and I can't think of a use case offhand where ebytes(stuff,'garbage') does better than PEP 383-enabled str for:
Being explicit about the encoding of the bytes you're flinging around is actually an *increase* in specificity, explicitness, robustness, and error-checking ability over the status quo for either 2.x *or* 3.x... *and* it improves these qualities for essentially *all* string-handling code, without requiring that code to be rewritten to do so.
A well-spoken piece. But, you see, most of those encodings are *only* interesting so that you can transcode characters to the encoding of interest. What's the e.o.i.? That is easily found in the context or has an obvious default, if you're lucky, or otherwise a hard problem that ebytes does nothing to help solve as far as I can see. Cf. Robert Collins' post <AANLkTinQ_d_vaHBw5IKUYY9qgjqOfFy4XCzC0DYztr9n@mail.gmail.com>, where he makes it quite explicit that a bytes interface is all about punting in the face of missing encoding information.
and (2) you really want this under control of higher level objects that have access to some knowledge of the environment, rather than the lowest level.
This proposal actually has such a higher-level object: an ebytes.
I don't see how that can be true. An ebytes is a very low-level object that has no idea whether its encoding is interesting (eg, the one that an RFC or a server specifies), or a technical detail of use only until the ebytes is decoded, then can be thrown away. I just don't see, in the case where there is a real encoding in the ebytes, what harm is done by decoding the ebytes to str. If context indicates that the encoding is an interesting one (eg, it should be the default for encoding on output), then you want to save that in an appropriate place that preserves not just the encoding itself, but the context that gives it its importance.
On Jun 22, 2010, at 03:08 AM, Stephen J. Turnbull wrote:
Barry Warsaw writes:
Would it make sense to have "encoding-carrying" bytes and str types?
Why limit that to bytes and str? Why not have all objects carry their serializer/deserializer around with them?
Only because the .encoding attribute isn't really a serializer/deserializer. That's still bytes() and str() or the equivalent. This is just a hint to a specific serializer for parameters to that action.
I think the answer is "no", though, because (1) it would constitute an attractive nuisance (the default would be abused, it would work fine in Kansas, and all hell would break loose in Kagoshima, simply delaying the pain and/or passing it on to third parties), and (2) you really want this under control of higher level objects that have access to some knowledge of the environment, rather than the lowest level.
I'm still not sure ebytes solves the problem, but it avoids one I'm most concerned about seeing proposed. I really really do not want to add encoding=blah arguments to boatloads of function signatures. -Barry
Barry Warsaw writes:
I'm still not sure ebytes solves the problem,
I don't see how it can. If you have an encoding to stuff into ebytes, you could just convert to Unicode and guarantee that all internal string operations will succeed. If you use ebytes instead, every string operation has to be wrapped in "try ... except EBytesError", to no gain that I can see. If you don't have an encoding, then you just have bytes, which strictly speaking shouldn't be operated on (in the sense of slicing, dicing, or stir-frying) at all if you're in an environment where they are a carrier for formatted information such as non-ASCII characters or PNG images.
but it avoids one I'm most concerned about seeing proposed. I really really do not want to add encoding=blah arguments to boatloads of function signatures.
Agreed. But ebytes isn't a solution to that; it's a regression to one of the hardest problems in Python 2. OTOH, it seems to me that there's only one boatload to worry about. That's the boatload containing protocol-less APIs, ie, Unix OS data (names in the filesystem, content of environment variables). Other platforms (Windows, Mac) are standardizing on protocols for these things and enforcing them in the OS, and free Unices are going to the convention that everything is non-normalized UTF-8. What other boats are you worried about?
participants (11)
-
Barry Warsaw
-
James Y Knight
-
John Arbash Meinel
-
M.-A. Lemburg
-
Neil Hodgson
-
Nick Coghlan
-
P.J. Eby
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Toshio Kuratomi