<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Adam Olsen wrote:

<blockquote

 cite="midaac2c7cb0602142014t553463c5g3f98922700aa734b@mail.gmail.com"

 type="cite">

  <pre wrap="">On 2/14/06, Just van Rossum <a class="moz-txt-link-rfc2396E" href="mailto:just@letterror.com">&lt;just@letterror.com&gt;</a> wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">+1 for two functions.

My choice would be open() for binary and opentext() for text. I don't

find that backwards at all: the text function is going to be more

different from the current open() function then the binary function

would be since in many ways the str type is closer to bytes than to

unicode.

Maybe it's even better to use opentext() AND openbinary(), and deprecate

plain open(). We could even introduce them at the same time as bytes()

(and leave the open() deprecation for 3.0).

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Thus providing us with a transition period, even with warnings on use

of the old function.

  </pre>

</blockquote>

[snip..]<br>

<br>

I personally like the move towards all unicode strings, basically any

text where you don't know the encoding used is 'random binary data'.

This works fine, so long as you are in control of the text source.

*However*, it leaves the following problem :<br>

<br>

The current situation (treating byte-sequences as text and assuming

they are an ascii-superset encoded text-string) *works* (albeit with

many breakages), simply because this assumption is usually correct.<br>

<br>

Forcing the programmer to be aware of encodings, also pushes the same

requirement onto the user (who is often the source of the text in

question).<br>

<br>

Currently you can read a text file and process it - making sure that

any changes/requirements only use ascii characters. It therefore

doesn't matter what 8 bit ascii-superset encoding is used in the

original. If you force the programmer to specify the encoding in order

to read the file, they would have to pass that requirement onto their

user. Their user is even less likely to be encoding aware than the

programmer.<br>

<br>

What this means, is that for simple programs where the programmer

doesn't want to have to worry about encoding, or can't force the user

to be aware, they will read in the file as bytes. Modules will quickly

and inevitably be created implementing all the 'string methods' for

bytes. New programmers will gravitate to these and the old mess will

continue, but with a more awkward hybrid than before. (String

manipulations of byte sequences will no longer be a core part of the

language - and so be harder to use.)<br>

<br>

Not sure what we can do to obviate this of course... but is this change

actually going to improve the situation or make it worse ?<br>

<br>

All the best,<br>

<br>

Michael Foord<br>

</body>

</html>