[Python-Dev] PEP 263 considered faulty (for some Japanese)

Paul Prescod paul@prescod.net
Mon, 18 Mar 2002 07:53:10 -0800


"Stephen J. Turnbull" wrote:
> 
>...
> 
> I don't see any need for a deviation of the implementation from the
> spec.  Just slurp in the whole file in the specified encoding. 

That's phase 2. It's harder to implement so it won't be in Python 2.3.
They are trying to get away with changing the *output* of the
lexer/parser rather than the *input* because the lexer/parser code
probably predates Unicode and certainly predates Guido's thinking about
internationalization issues. We're moving in baby steps.

> ... Then
> cast the Unicode characters in ordinary literal strings down to
> bytesize (my preference, probably with errors on Latin-1<0.5 wink>) or
> reencode them (Guido's and your suggestion).  People who don't like
> the results in their non-Unicode literal strings (probably few) should
> use hex escapes.  Sure, you'll have to rewrite the parser in terms of
> UTF-16.  But I thought that was where you were going anyway.

Sure, but a partial implementation now is better than a perfect
implementation at some unspecified time in the future.

> If not, it should be nearly trivial to rewrite the parser in terms of
> UTF-8 (since it is a superset of ASCII and non-ASCII is currently only
> allowed in comments or guarded by a (Unicode)? string literal AFAIK).
> The main issue would be anything that involves counting characters
> (not bytes!), I think.  

That would be an issue. Plus it would be the first place that the Python
interpreter used UTF-8 as an internal representation. So it would also
be a half-step, but it might involve more redoing later.

 Paul Prescod