Further changes to source encodings (Was: PEP 263 status check)

Sat Aug 7 01:03:51 EDT 2004

John Roth wrote:
> I don't believe I ever said that PEP 263 said there was
> a difference. If I gave you that impression, I will
> appologize if you can show me where it I did it.

In <10h5hgvpafm8a64 at news.supernews.com>, titled
" PEP 263 status check", you write

[quote]
My specific question there was how the code handles the
combination of UTF-8 as the encoding and a non-ascii
character in an 8-bit string literal. Is this an error?
[end quote]

So I assumed you were all the time talking about how this
is implemented, and how you expected to be implemented,
and I assumed we agree that the implementation should
match the specification in PEP 263.

> As far as I'm concerned, what PEP 263 says is utterly
> irrelevant to the point I'm trying to make.

Then I don't know what the point is you are trying to
make. It appears that you are now saying that Python
does not work the way it should work. IOW, you are
proposing that it be changed, right? This sounds like
another PEP.

> The only connection PEP 263 has to the entire thread
> (at least from my view) is that I wanted to check on
> whether phase 2, as described in the PEP, was
> scheduled for 2.4. I was under the impression it was
> and was puzzled by not seeing it. You said it wouldn't
> be in 2.4. Question answered, no further issue on
> that point (but see below for an additonal puzzlement.)

Ok. A change of subject might have helped.

> 8-bit strings have a builtin assumption that one
> byte equals one character. 

Not at all. Some 8-bit strings don't denote characters
at all, and some 8-bit string, atleast in some regions
of the world, are deliberately using multi-byte character
encodings. In particular, UTF-8 is such an encoding.

> It's a basic assumption
> in the string module, the string methods and all through
> just about everything, and it's something that most
> programmers expect, and IMO have every right
> to expect.

Not at all. Most string methods don't assume anything
about characters. Instead, they assume that the building
block of a byte string is a "byte", and operate on those.
Only some methods of the string objects assume that the
bytes denote characters; they typically assume that the
current locale provides the definition of the character
set.

> Now, people violate this assumption all the time,
> for a number of reasons, including binary data and
> encoded data (including utf-8 encodings)
> but they do so deliberately, knowing what they're
> doing. These particular exceptions don't negate the
> rule.

Not at all. These usages are deliberate, equally-righted
applications of the string type. In Python, the string
type really is meant for binary data (unlike, say, C,
which has issues with NUL bytes).

> The problem I have is that if you use utf-8 as the
> source encoding, you can suddenly drop multi-byte
> characters into an 8-bit string ***BY ACCIDENT***.

Ok.

> (I don't know what happens with far Eastern multi-byte
> encodings.)

The same issues as UTF-8, plus some additional worse issues.

> Now, my suggested solution of this problem was
> to require that 8-bit string literals in source that was
> encoded with UTF-8 be restricted to the 7-bit
> ascii subset.

Ok. I disagree that this is desirable; if you really
want to see that happen, you should write a PEP.

> The second possibility begs the question of what
> encoding to use, which is why I don't seriously
> propose it (although if I understand Hallvard's
> position correctly, that's essentially his proposal.)

No. He proposes your third alternative (ban non-ASCII
characters in byte string literals), not just for UTF-8,
but for all encodings. Not for all files, though, but
only for selected files.

>>If
>>there is no encoding declaration whatsoever, Python will
>>assume that the source is us-ascii.
[...]
> The last sentence puzzles me. In 2.3, absent a declaration
> (and absent a parameter on the interpreter) Python assumes
> that the source is Latin-1, and phase 2 was to change
> this to the 7-bit ascii subset (US-Ascii). That was the
> original question at the start of this thread. I had assumed
> that change was to go into 2.4, your reply made it seem
> that it would go into 2.5 (maybe.) This statement makes
> it seem that it is the current state in 2.3.

With "will assume", I actually meant future tense. Not
being a native speaker, I'm uncertain how to distinguish
this from the conditional form that you apparently understood.

> Specifically, what would the Python 2.2 interpreter
> have done if I handed it a program encoded in utf-8?
> Was that a legitimate encoding? 

Yes, the Python interpeter would have processed it.

print "Grüß Gott"

would have send the greeting to the terminal.

 > I don't know whether
> it was or not. Clearly it wouldn't have been possible
> before the unicode support in 2.0.

Why do you think so? The above print statement has worked
since Python 1.0 or so. Before PEP 263, Python was unaware
of source encodings, and would literally copy the bytes
from the source code file into the string object - whether
they were latin-1, UTF-8, or some other encoding. The
only requirement was that the encoding needs to be an
ASCII superset, so that Python properly detects the end
of the string.

Regards,
Martin