[Python-3000] PEP: Supporting Non-ASCII Identifiers

Tue Jun 5 07:21:37 CEST 2007

On 6/4/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> On 6/4/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> > However, what would that mean wrt. non-Unicode source encodings.
>
> > Say you have a Latin-1-encoded source code. Is that in NFC or not?

The path of least surprise for legacy encodings might be for
the codecs to produce whatever is closest to the original encoding
if possible. I.e. what was one code point would remain one code
point, and if that's not possible then normalize. I don't know if
this is any different from always normalizing (it certainly is
the same for Latin-1).

Always normalizing would have the advantage of simplicity (no matter
what the encoding, the result is the same), and I think that is
the real path of least surprise if you sum over all surprises.

> FWIW, I would prefer "the parser will normalize" to "the parser will
> reject unnormalized", to support even the dumbest of editors.

Me too, as simple open-save in a dumb editor wouldn't change the
semantics of the code, and if any edits are made where the user
expects for some reason that normalization is not done then the first
trial run will immediately disabuse them of this notion. The behavior
is simple to infer and reliable (at least for "always normalize").

FWIW, I looked at what Java and XML 1.1 do, and they *don't* normalize
for some reason. Java doesn't even normalize identifiers AFAICS, it's
not even mentioned at
http://java.sun.com/docs/books/jls/third_edition/html/lexical.html
and they even process escapes very early (those should certainly not
be normalized, as escapes are the Word of Programmer and meddling
with them will incur holy wrath).

XML 1.1 says this:

:XML processors MUST NOT transform the input to be in fully normalized
:form. XML applications that create XML 1.1 output from either XML 1.1
:or XML 1.0 input SHOULD ensure that the output is fully normalized;
:it is not necessary for internal processing forms to be fully
:normalized.
:
:The purpose of this section is to strongly encourage XML processors
:to ensure that the creators of XML documents have properly normalized
:them, so that XML applications can make tests such as identity
:comparisons of strings without having to worry about the different
:possible "spellings" of strings which Unicode allows.
:
:When entities are in a non-Unicode encoding, if the processor
:transcodes them to Unicode, it SHOULD use a normalizing transcoder.

I do not know why they've done this, but XML 1.0 does not mention
normalization at all, so perhaps they felt normalization would be
too big a change. Some random comments I read mentioned that XML 1.1
is supposed to be independent of changes to Unicode and normalization
may change for new code points in new versions, and some said that
the inavailability of normalizers to implementors would be a reason.
Verification is specified in XML 1.1, though:

:However, a document is still well-formed even if it is not fully
:normalized. XML processors SHOULD provide a user option to verify
:that the document being processed is in fully normalized form, and
:report to the application whether it is or not. The option to not
:verify SHOULD be chosen only when the input text is certified, as
:defined by B Definitions for Character Normalization.

Note that all this applies after character entity (=escape)
replacement, and applies also to what passes for "identifiers"
in XML documents.

I still think simply always normalizing the whole source code file
to NFC before any processing would be the right thing to do :-)
I'm not sure about processing of text files in Python code, it's
certainly easy to do the normalization yourself. Still, it's probably
what's wanted in most cases where line separators are normalized.