[Python-3000] PEP: Supporting Non-ASCII Identifiers

Tue Jun 5 20:48:40 CEST 2007

On 6/5/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Jim Jewett schrieb:
> > On 6/5/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> >> > Always normalizing would have the advantage of simplicity (no
> >> > matter what the encoding, the result is the same), and I think
> >> > that is the real path of least surprise if you sum over all
> >> > surprises.

> >> I'd like to repeat that this is out of scope of this PEP, though.
> >> This PEP doesn't, and shouldn't, specify how string literals get
> >> from source to execution.

> > I see that as a gray area.

> Please read the PEP title again. What is unclear about
> "Supporting Non-ASCII Identifiers"?

That strings can also be used as identifiers.

> > Unicode does say pretty clearly that (at least) canonical equivalents
> > must be treated the same.

> Chapter and verse, please?

I am pretty sure this list is not exhaustive, but it may be helpful:

The Identifiers Annex http://www.unicode.org/reports/tr31/

"""
UAX31-C2.	An implementation claiming conformance to Level 1 of this
specification shall describe which of the following it observes:

R1 Default Identifiers
R2 Alternative Identifiers
R3 Pattern_White_Space and Pattern_Syntax Characters
R4 Normalized Identifiers
R5 Case-Insensitive Identifiers
"""

I interpret this as "If we normalize the Identifiers, then we must
observe R4."  R4 lets us exclude individual characters from
normalization, but it says that two IDs with the same Normalization
Form are equivalent, unless they include specifically excluded
characters.

"""
R4 	Normalized Identifiers

To meet this requirement, an implementation shall specify the
Normalization Form and shall provide a precise list of any characters
that are excluded from normalization. If the Normalization Form is
NFKC, the implementation shall apply the modifications in Section 5.1,
NFKC Modifications, given by the properties XID_Start and
XID_Continue. Except for identifiers containing excluded characters,
any two identifiers that have the same Normalization Form shall be
treated as equivalent by the implementation.
"""

Additional Support:

The Normalization Annex http://www.unicode.org/reports/tr15/ near the
end of section 1 (but before 1.1)

"""
Normalization Forms KC and KD must not be blindly applied to arbitrary text.
""" ... """
They can be applied more freely to domains with restricted character
sets, such as in Section 13, Programming Language Identifiers.
"""
(section 13 then forwards back to UAX31)

TR 15, section 19, numbered paragraph 3
"""
Higher-level processes that transform or compare strings, or that
perform other higher-level functions, must respect canonical
equivalence or problems will result.
"""

Looking at the main standard, I revert to Unicode 4 because it is
online at http://www.unicode.org/versions/Unicode4.0.0/

2.2 Equivalent Sequences
""" ...
If an application or user attempts to distinguish non-identical
sequences which are nonetheless considered to be equivalent sequences,
as shown in the examples in Figure 2-6, it would not be guaranteed
that other applications or users would recognize the same
distinctions.  To prevent introducing interoperability problems
between applications, such distinctions must be avoided wherever
possible.
"""
which is echoed in chapter 3 (conformance)
"""
C9 A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct.
...
Ideally, an implementation would always interpret two
canonical-equivalent character sequences identically. There are
practical circumstances under which implementations may reasonably
distinguish them.
"""
"""
C10 When a process purports not to modify the interpretation of a
valid coded character representation, it shall make no change to that
coded character representation other than the possible replacement of
character sequences by their canonical-equivalent sequences or the
deletion of noncharacter code points.
...
All processes and higher-level protocols are required to abide by C10
as a minimum.  However, higher-level protocols may define additional
equivalences that do not constitute modifications under that protocol.
For example, a higher-level protocol may allow a sequence of spaces to
be replaced by a single space.
"""

> > In theory, this could be done only to identifiers, but then it needs
> > to be done inline for getattr.

> Why that? The caller of getattr would need to apply normalization in
> case the input isn't known to be normalized?

OK, I suppose that might work, if documented, but ... it seems like
another piece of boilerplate; when it isn't there, it won't really be
because the input is normalized so after as it is because the author
didn't think about normalization.

-jJ