[Python-ideas] Adding 'bytes' as alias for 'latin_1' codec.

Nick Coghlan ncoghlan at gmail.com
Sat May 28 12:47:48 CEST 2011


On Sat, May 28, 2011 at 12:23 PM, Ethan Furman <ethan at stoneleaf.us> wrote:
> Greg Ewing wrote:
>> How would ascii behave when mixed with unicode strings? Should it
>> automatically coerce to unicode, or should an explicit decode()
>> be required?
>
> And what happens when a char > 127 hits the ascii stream?

These are the kinds of questions that make it clear that the answer
here is far from being as simple as merely adding more string methods
to the existing bytes type. The underlying data model is simply
*wrong* for working with bytes as if they were text.

For a previous, more flexible, incarnation of this idea, Barry's post
is the earlier record I found of the idea of a byte sequence oriented
type that carried its encoding metadata along with it:
http://mail.python.org/pipermail/python-dev/2010-June/100777.html

However, supporting multi-byte codes (and other stateful codecs like
ShiftJIS) poses problems for slicing operations (just as it does for
us already in Unicode slicing).

Hence the possibility of strictly limiting this to 7-bit ASCII - the
main problem with most bytes-as-text suggestions is that they don't
work for arbitrary subsets of the codecs available in the standard
library and it generally isn't entirely clear which codecs will work
and which ones won't.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia



More information about the Python-ideas mailing list