
On Sat, May 28, 2011 at 12:23 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
Greg Ewing wrote:
How would ascii behave when mixed with unicode strings? Should it automatically coerce to unicode, or should an explicit decode() be required?
And what happens when a char > 127 hits the ascii stream?
These are the kinds of questions that make it clear that the answer here is far from being as simple as merely adding more string methods to the existing bytes type. The underlying data model is simply *wrong* for working with bytes as if they were text. For a previous, more flexible, incarnation of this idea, Barry's post is the earlier record I found of the idea of a byte sequence oriented type that carried its encoding metadata along with it: http://mail.python.org/pipermail/python-dev/2010-June/100777.html However, supporting multi-byte codes (and other stateful codecs like ShiftJIS) poses problems for slicing operations (just as it does for us already in Unicode slicing). Hence the possibility of strictly limiting this to 7-bit ASCII - the main problem with most bytes-as-text suggestions is that they don't work for arbitrary subsets of the codecs available in the standard library and it generally isn't entirely clear which codecs will work and which ones won't. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia