
On Wed, May 18, 2011 at 3:13 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Wed, May 18, 2011 at 8:27 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
On the one hand we have the 'bytes are ascii data' type interface, and on the other we have the 'bytes are a list of integers between 0 - 256' interface.
No. Bytes are a list of integers between 0-256. End of story. Using them to represent text as well was precisely the problem with 2.x 8-bit strings, since the boundaries got blurred.
However, as a matter of practicality, many byte-oriented protocols use ASCII to make elements of the protocol readable by humans. The "text-like" elements of the bytes and bytearray types are a concession to the existence of those protocols. However, that doesn't make them text - they're still binary data streams. If you want to treat them as text, convert them to "str" objects first (e.g. that's what urlib.urlparse does internally in order to operate on bytes and bytearray instances).
This is a not a useful argument - its an implementation choice in Python 3, and urlparse converting bytes to 'str' to operate on them is at best a kludge - you're forcing 5 times the storage (the original bytes + 4 bytes-per-byte when its decoded into unicode) to work on something which is defined as a BNF * that uses ascii *. The Python 2 confusion was deplorable, but it doesn't make the Python 3 situation better: its different, but still very awkward for people to write code that is correct and fast in. Its probably too late to change, but please don't try to argue that its correct: the continued confusion of folk running into this is evidence that confusion *is happening*. Treat that as evidence and think about how to fix it going forward. _Rob