[Python-Dev] PEP 263 - default encoding
Guido van Rossum
guido@python.org
Fri, 15 Mar 2002 22:58:46 -0500
> >> a. Does this really make sense for UTF-16? It looks to me like
> >> a great way to induce bugs of the form "write a unicode literal
> >> containing 0x0A, then translate it to raw form by stripping the
> >> u prefix."
>
> Guido> Of course not. I don't expect anyone to put UTF-16 in their
> Guido> source encoding cookie.
>
> Mr. Suzuki's friends. People who use UTF-16 strings in other
> applications (eg Java), but otherwise are happy with English.
I don't understand the mechanics, unless they encode the entire file
in UTF-16. And then Python can't parse it, because it assumes ASCII.
I think even Mr. Suzuki isn't thinking of using UTF-16 in his Unicode
literals. He currently sets UTF-16 as the default encoding for data
that he presumably reads from a file.
> Guido> But should we bother making a list of encodings that
> Guido> shouldn't be used?
>
> I would say yes. People will find reasons to inflict harm on
> themselves if you don't.
Any file that's encoded in an encoding (such as UTF-16) that's not an
ASCII superset is unparseable for Python -- Python would never even
get to the point of recognizing the comment with the encoding cookie.
So I doubt that this will be a problem. It's like the danger of
hitting yourself in the head with a 16-ton weight -- in order to swing
it, you'd first have to lift it...
The other interpretation (that they would use UTF-16 inside u"" and
ASCII elsewhere) is just as insane, since no person implementing a
text editor with any form of encoding support would be insane enough
support such a mixed-mode encoding.
> >> b. No editor is likely to implement correct display to
> >> distinguish between u"" and just "".
>
> Guido> That's fine. Given phase 2, the editor should display the
> Guido> entire file using the encoding given in the cookie, despite
> Guido> that phase 1 only applies the encoding to u"" literals.
> Guido> The rest of the file is supposed to be ASCII, and if it
> Guido> isn't, that's the user's problem.
>
> Huh? I thought that people were regularly putting arbitrary text into
> ordinary strings, and that the whole purpose of this PEP was to extend
> that practice to Unicode.
>
> Are you going to deprecate the practice of putting KOI8-R into
> ordinary strings? This means that Cyrillic users have stop doing
> that, change the string to Unicode, and apply codecs on IO. They
> aren't going to bother in phase 1, will have a rude surprise in phase
> 2. That's human nature, of course, but I don't see how it serves
> Python to risk it.
I wasn't clear on what you meant (see below).
I think this will actually work. Suppose someone uses KOI8-R.
Presumably they have an editor that reads, writes and displays
KOI8-R, and their default interpretation of Python's stdout will also
assume KOI8-R.
Thus, if their program contains
k = "...some KOI8-R string..."
print k
it will print what they want. If they write this:
u = unicode(k, "koi8-r")
it will also do what they want. Currently, if they write
u = u"...some KOI8-R string..."
it won't work, but with the PEP, in phase 1, it will do the right
thing as long as they add a KOI8-R cookie to the file. The treatment
of the 8-bit string assigned to k will not change in phase 1.
But the treatment of k under phase 2 will be, um, interesting, and I'm
not sure what it should do!!! Since in phase 2 the entire file will
be decoded from KOI8-R to Unicode before it's parsed, maybe the best
thing would be to encode 8-bit string literals back using KOI8-R (in
general, the encoding given in the encoding cookie).
*** MAL, can you think about this? ***
> >> e. This causes problems for UTF-8 transition, since people will
> >> want to put arbitrary byte strings in a raw string.
>
> Guido> I'm not sure I understand. What do you call a raw string?
> Guido> Do you mean an r"" literal? Why would people want to use
> Guido> that for arbitrary binary data? Arbitrary binary data
> Guido> should *always* be encoded using \xDD hex or \OOO octal
> Guido> escapes.
>
> raw -> non-Unicode here. Incorrect usage, my apologies. "Arbitrary"
> was the wrong word too, I mean non-UTF-8. Eg, iso-8859-1 0xFF. I
> would have not problem with requiring people to use escapes to write
> non-English strings. But the whole point of this PEP is to allow
> people to write those in their native encodings (for Unicode strings).
> People are going to continue to squirt implicitly coded octet-strings
> at their terminals (which just happen to have an appropriate font
> installed<wink>) and expect it to work.
How about the solution I suggested above? Basically, the encoding
used for 8-bit string literals better match the encoding cookie used
for the source file, otherwise all bets are off. But this should
match common usage -- all people have to do is add the encoding cookie
to their file.
> AFAICT this interpretation of the PEP saves no pain, simply postpones
> it. Worse, people who don't understand it fully are going to believe
> it sanctions arbitrary encodings in string literals.
IMO, only one arbitrary encoding will be used per user -- his/her
favorite, default, and that's what they'll put in their encoding
cookie once we train them properly.
> I don't see how
> you can avoid widespread misunderstanding of that sort unless you have
> the parser refuse to execute the program---it may actually increase
> the pain when phase 2 starts.
>
> Guido> Sounds like a YAGNI to me.
>
> Could be. I'm sorry I can't be less fuzzy about the specific points.
> But then, that's the whole problem, really---we're trying to serve
> natural language usage which is inherently fuzzy.
>
> I see lots of potential problems in interpretation of this PEP by the
> people it's intended to serve: those who are attached to some native
> encoding. Better to raise each now, and have the scorn it deserves
> heaped high, than to say "we coulda guessed this would happen" later.
>
> If you think it's getting too abstract to be useful, I'll be quiet
> until I've got something more concrete. I'm hoping the the discussion
> seems useful despite the fuzz.
Same here. If you still think it's necessary, maybe you can try to
express exactly when you would want a program to be declared illegal
because of expected problems in phase 2?
--Guido van Rossum (home page: http://www.python.org/~guido/)