[Python-Dev] str object going in Py3K

Wed Feb 15 13:19:02 CET 2006

Adam Olsen wrote:
> On 2/14/06, Just van Rossum <just at letterror.com> wrote:
>   
>> +1 for two functions.
>>
>> My choice would be open() for binary and opentext() for text. I don't
>> find that backwards at all: the text function is going to be more
>> different from the current open() function then the binary function
>> would be since in many ways the str type is closer to bytes than to
>> unicode.
>>
>> Maybe it's even better to use opentext() AND openbinary(), and deprecate
>> plain open(). We could even introduce them at the same time as bytes()
>> (and leave the open() deprecation for 3.0).
>>     
>
> Thus providing us with a transition period, even with warnings on use
> of the old function.
>   
[snip..]

I personally like the move towards all unicode strings, basically any 
text where you don't know the encoding used is 'random binary data'. 
This works fine, so long as you are in control of the text source. 
*However*, it leaves the following problem :

The current situation (treating byte-sequences as text and assuming they 
are an ascii-superset encoded text-string) *works* (albeit with many 
breakages), simply because this assumption is usually correct.

Forcing the programmer to be aware of encodings, also pushes the same 
requirement onto the user (who is often the source of the text in question).

Currently you can read a text file and process it - making sure that any 
changes/requirements only use ascii characters. It therefore doesn't 
matter what 8 bit ascii-superset encoding is used in the original. If 
you force the programmer to specify the encoding in order to read the 
file, they would have to pass that requirement onto their user. Their 
user is even less likely to be encoding aware than the programmer.

What this means, is that for simple programs where the programmer 
doesn't want to have to worry about encoding, or can't force the user to 
be aware, they will read in the file as bytes. Modules will quickly and 
inevitably be created implementing all the 'string methods' for bytes. 
New programmers will gravitate to these and the old mess will continue, 
but with a more awkward hybrid than before. (String manipulations of 
byte sequences will no longer be a core part of the language - and so be 
harder to use.)

Not sure what we can do to obviate this of course... but is this change 
actually going to improve the situation or make it worse ?

All the best,

Michael Foord
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-dev/attachments/20060215/7ce57d63/attachment.htm