[Python-Dev] PEP 263 - default encoding

Guido van Rossum guido@python.org
Fri, 15 Mar 2002 22:58:46 -0500


>     >> a. Does this really make sense for UTF-16?  It looks to me like
>     >> a great way to induce bugs of the form "write a unicode literal
>     >> containing 0x0A, then translate it to raw form by stripping the
>     >> u prefix."
> 
>     Guido> Of course not. I don't expect anyone to put UTF-16 in their
>     Guido> source encoding cookie.
> 
> Mr. Suzuki's friends.  People who use UTF-16 strings in other
> applications (eg Java), but otherwise are happy with English.

I don't understand the mechanics, unless they encode the entire file
in UTF-16.  And then Python can't parse it, because it assumes ASCII.

I think even Mr. Suzuki isn't thinking of using UTF-16 in his Unicode
literals.  He currently sets UTF-16 as the default encoding for data
that he presumably reads from a file.

>     Guido> But should we bother making a list of encodings that
>     Guido> shouldn't be used?
> 
> I would say yes.  People will find reasons to inflict harm on
> themselves if you don't.

Any file that's encoded in an encoding (such as UTF-16) that's not an
ASCII superset is unparseable for Python -- Python would never even
get to the point of recognizing the comment with the encoding cookie.
So I doubt that this will be a problem.  It's like the danger of
hitting yourself in the head with a 16-ton weight -- in order to swing
it, you'd first have to lift it...

The other interpretation (that they would use UTF-16 inside u"" and
ASCII elsewhere) is just as insane, since no person implementing a
text editor with any form of encoding support would be insane enough
support such a mixed-mode encoding.

>     >> b. No editor is likely to implement correct display to
>     >> distinguish between u"" and just "".
> 
>     Guido> That's fine.  Given phase 2, the editor should display the
>     Guido> entire file using the encoding given in the cookie, despite
>     Guido> that phase 1 only applies the encoding to u"" literals.
>     Guido> The rest of the file is supposed to be ASCII, and if it
>     Guido> isn't, that's the user's problem.
> 
> Huh?  I thought that people were regularly putting arbitrary text into
> ordinary strings, and that the whole purpose of this PEP was to extend
> that practice to Unicode.
> 
> Are you going to deprecate the practice of putting KOI8-R into
> ordinary strings?  This means that Cyrillic users have stop doing
> that, change the string to Unicode, and apply codecs on IO.  They
> aren't going to bother in phase 1, will have a rude surprise in phase
> 2.  That's human nature, of course, but I don't see how it serves
> Python to risk it.

I wasn't clear on what you meant (see below).

I think this will actually work.  Suppose someone uses KOI8-R.
Presumably they have an editor that reads, writes and displays
KOI8-R, and their default interpretation of Python's stdout will also
assume KOI8-R.

Thus, if their program contains

    k = "...some KOI8-R string..."
    print k

it will print what they want.  If they write this:

    u = unicode(k, "koi8-r")

it will also do what they want.  Currently, if they write

    u = u"...some KOI8-R string..."

it won't work, but with the PEP, in phase 1, it will do the right
thing as long as they add a KOI8-R cookie to the file.  The treatment
of the 8-bit string assigned to k will not change in phase 1.

But the treatment of k under phase 2 will be, um, interesting, and I'm
not sure what it should do!!!  Since in phase 2 the entire file will
be decoded from KOI8-R to Unicode before it's parsed, maybe the best
thing would be to encode 8-bit string literals back using KOI8-R (in
general, the encoding given in the encoding cookie).

    *** MAL, can you think about this? ***

>     >> e. This causes problems for UTF-8 transition, since people will
>     >> want to put arbitrary byte strings in a raw string.
> 
>     Guido> I'm not sure I understand.  What do you call a raw string?
>     Guido> Do you mean an r"" literal?  Why would people want to use
>     Guido> that for arbitrary binary data?  Arbitrary binary data
>     Guido> should *always* be encoded using \xDD hex or \OOO octal
>     Guido> escapes.
> 
> raw -> non-Unicode here.  Incorrect usage, my apologies.  "Arbitrary"
> was the wrong word too, I mean non-UTF-8.  Eg, iso-8859-1 0xFF.  I
> would have not problem with requiring people to use escapes to write
> non-English strings.  But the whole point of this PEP is to allow
> people to write those in their native encodings (for Unicode strings).
> People are going to continue to squirt implicitly coded octet-strings
> at their terminals (which just happen to have an appropriate font
> installed<wink>) and expect it to work.

How about the solution I suggested above?  Basically, the encoding
used for 8-bit string literals better match the encoding cookie used
for the source file, otherwise all bets are off.  But this should
match common usage -- all people have to do is add the encoding cookie
to their file.

> AFAICT this interpretation of the PEP saves no pain, simply postpones
> it.  Worse, people who don't understand it fully are going to believe
> it sanctions arbitrary encodings in string literals.

IMO, only one arbitrary encoding will be used per user -- his/her
favorite, default, and that's what they'll put in their encoding
cookie once we train them properly.

>                                                       I don't see how
> you can avoid widespread misunderstanding of that sort unless you have
> the parser refuse to execute the program---it may actually increase
> the pain when phase 2 starts.
> 
>     Guido> Sounds like a YAGNI to me.
> 
> Could be.  I'm sorry I can't be less fuzzy about the specific points.
> But then, that's the whole problem, really---we're trying to serve
> natural language usage which is inherently fuzzy.
> 
> I see lots of potential problems in interpretation of this PEP by the
> people it's intended to serve: those who are attached to some native
> encoding.  Better to raise each now, and have the scorn it deserves
> heaped high, than to say "we coulda guessed this would happen" later.
> 
> If you think it's getting too abstract to be useful, I'll be quiet
> until I've got something more concrete.  I'm hoping the the discussion
> seems useful despite the fuzz.

Same here.  If you still think it's necessary, maybe you can try to
express exactly when you would want a program to be declared illegal
because of expected problems in phase 2?

--Guido van Rossum (home page: http://www.python.org/~guido/)