[I18n-sig] PEP 263 and Japanese native encodings

Martin v. Loewis martin@v.loewis.de
06 Mar 2002 19:03:07 +0100


Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp> writes:

> I think
> these encodings are considered "ASCII compatible" in the sense
> you mention in the following paragraph in the "Concepts" section:
> 
>   Only ASCII compatible encodings are allowed as source code
>   encoding to assure that Python language elements other than
>   literals and comments remain readable by ASCII processing tools
>   and to avoid problems with wide characters encodings such as
>   UTF-16.

My original definition of "ASCII compatible" would have been

  "An encoding X is ASCII compatible iff a text that consists only of
   ASCII characters is byte-for-byte identical when encoded with X,
   compared to the same text encoded in ASCII"

Under this definition, iso-2022-jp would be ASCII compatible, but it
still is not acceptable under the implementation that I have in mind
for the patch.

>   An ASCII compatible encoding (character set) is a superset of
>   the ASCII encoding (character set) in which octets from 0x00
>   to 0x7f are only used to represent ASCII characters and not
>   used in a series of bytes that represent a multibyte character
>   (such as Kanji and Hiragana).

Indeed, this is the definition which the reference implementation of
the PEP currently relies on.

> This definition is too restrictive IMHO, but anyway the term
> "ASCII compatible" is somewhat obscure and needs clarification
> since there are at least two interpretations.  

It would be possible to somewhat losen this definition, defining
"ASCII string" compatible

  An ASCII string compatible encoding (character set) is a superset of
  the ASCII encoding (character set) in which octets from set AS are
  only used to represent ASCII characters and not used in a series of
  bytes that represent a multibyte character (such as Kanji and
  Hiragana). The set AS is defined as

  AS = [\r\n\\'"] (newline, linefeed, backslash, single/double quote)

The rationale here is that, under the PEP, non-ASCII text may only
appear in comments and strings. The lexer needs the ASCII-compatible
property to determine the end-of-line and end-of-string markers,
atleast in the phase-1 implementation.

> o Are three Japanese native encodings EUC-JP, Shift_JIS and
>   ISO-2022-JP "ASCII compatible"?

EUC-JP certainly is; ISO-2022-JP probably isn't. I cannot see the
problem with Shift_JIS; I thought is uses only non-ASCII bytes for the
double-byte characters (and that this is precisely what the "shift" in
Shift_JIS refers to); see

http://www.io.com/~kazushi/encoding/sjis.html

If you are referring to the common interpretation that Shift_JIS uses
JIS X 0201-1976 for the first 128 bytes, I think we can take a relaxed
position here:

1. The only differences between JIS X 0201 and ISO 646 IRV (aka ASCII)
   are \x24 (CURRENCY SIGN vs. DOLLAR SIGN) and \x5C (YEN SIGN vs.
   REVERSE SOLIDUS).
2. \x24 is not in AS.
3. Backslash could cause a problem, if people insist on putting the Yen
   sign into a string literal. Even though this isn't strictly supported
   under PEP 263, people would get away with that most of the time.
4. I understand that Microsoft's interpretation of Shift_JIS actually
   is that \x24 *does* represent REVERSE SOLIDUS, and that only the
   fonts display something else.

Regards,
Martin