[Python-Dev] PEP 263 considered faulty (for some Japanese

21 Mar 2002 21:10:21 +0900

>>>>> "Martin" == Martin v Loewis <martin@v.loewis.de> writes:

    Martin> If the PEP would say

    Martin> [A Python program is a sequence of Unicode characters]

    Martin> it would not change a bit, in my view. Why do you perceive
    Martin> a difference?

Because the PEP currently specifies an implementation in which it
isn't true that Python handles Unicode character data, in fact.  The
PEP says:

       4. compile it, creating Unicode objects from the given Unicode data
          and creating string objects from the Unicode literal data
          by first reencoding the Unicode data into 8-bit string data
          using the given file encoding

       5. variable names and other identifiers will be reencoded into
          8-bit strings using the file encoding to assure backward
          compatibility with the existing implementation

Thus Python deals internally with at least two encodings (Unicode and
external) and possibly three (since identifiers are known to be ASCII,
differences in handling strings and identifiers might arise).

Re: hooks.

    Martin> which I understood to deliberately not deal at all with
    Martin> encodings - doing so would be the user's task.

True, the hooks idea would be a general user-specified preprocessor
invoked on code.  The point of leaving the codecs to the user is not,
however, intended to leave that user helpless and without guidance.
Of course a library of useful codecs would (presumably) be provided,
including a meta-codec to implement coding cookie recognition.  If the
library was half-decent users wouldn't think of doing anything but to
accept the implicit policy of the codec library.

    Martin> XML is completely different in this respect.

Yes.  It should be clear that I'd be happy to adopt the XML approach.

The PEP itself and the discussion also make it clear that that's not
acceptable, because ordinary literals don't contain (Unicode)
characters.  Using a character-based notation (eg, hex) for _all_
literal octets is unacceptable from the point of view of backward
compatibility and convenience in a language one of whose applications
is scripting.  XML doesn't have that problem.

    Martin> I can find equivalents of all this in PEP 263. For
    Martin> example, it is a fatal error (in phase 2) if a Python
    Martin> source file contains no encoding declaration and its
    Martin> content is not legal ASCII.

I can see where the analogous language has been inserted.  However,
that doesn't mean the PEP provides the same kind of coherent semantics
that the XML spec does.  In fact, it does not.

    Martin> "parsed" in the context of XML means that the entity has
    Martin> markup, and thus follows the production extParsedEnt (for
    Martin> example). The production rules always refer to characters,
    Martin> which are obtained from converting the input file into
    Martin> Unicode, according to the declared encoding.

In other words, quite different from PEP 263, which specifies three
kinds of objects where XML has characters: characters (Unicode),
identifier constituents (ASCII), and ordinary string constituents
(bytes in an external encoding).

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
              Don't ask how you can "do" free software business;
              ask what your business can "do for" free software.