On Sat, May 28, 2011 at 10:55 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Nick Coghlan wrote:
The pedagogic cost of making it even harder than it already is to convince people that bytes are not text would also need to be considered.
I think that boat was missed some time ago. If there were ever a serious intention to teach people that bytes are not text by limiting the feature set of bytes, it would have been better served by not giving bytes *any* features that assumed a particular encoding.
As it is, bytes has quite a lot of features that implicitly treat it as ascii-encoded text: the literal and repr() forms, capitalize(), expandtabs(), lower(), splitlines(), swapcase(), title(), upper(), and all the is*() methods.
Accepting all of that, and then saying "Oh, no, we couldn't possibly provide a format() method, because bytes are not text" seems a tad inconsistent.
Originally we didn't have all of that - more and more of it crept back in at the behest of several binary protocol folks (including me, if I recall correctly). The urllib.parse experience has convinced me that giving in to that pressure was a mistake. We went for a premature optimisation, and screwed up the bytes API as a result. Yes, there is a potential performance issue with the decode/process/encode model, but simple keeping a bunch of string methods in the bytes API was the wrong answer (and something that isn't actually all that useful in practice, for the reasons brought up in this and other recent threads). Perhaps it is time to resurrect the idea of an explicit 'ascii' type? Add a'' literals, support the full string API as well as the bytes API, deprecate all string APIs on bytes and bytearray objects. The other thing I have learned in trying to deal with some of these issues is that ASCII-encoded text really *is* special, compared to all other encodings, due to its widespread use in a multitude of networking protocols and other formats. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia