[Python-3000] PEP: Supporting Non-ASCII Identifiers

Sun Jun 3 20:30:01 CEST 2007

On 6/3/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> Sure - but how can Python tell whether a non-normalized string was
> intentionally put into the source, or as a side effect of the editor
> modifying it?

It can't, but does it really need to? It could always assume the latter.

> In most cases, it won't matter. If it does, it should be explicit
> in the code, e.g. by putting an n() function around the string
> literal.

This is only almost true. Consider these two hypothetical files
written by naive newbies:

data.py:

favorite_colors = {'Martin Löwis': 'blue'}

code.py:

import data

print data.favorite_colors['Martin Löwis']

Now if these are written by two different people using different
editors, one might be normalized in a different way than the other,
and the code would look all right but mysteriously fail to work.

Even more mysteriously, when the files are opened and saved
(possibly even automatically) by one of the people without any
changes, the code would then start to work. And magically break again
when the other person edits one of the files.

The most important thing about normalization is that it should be
consistent for internal strings. Similarly when reading in a text
file, you really should normalize it first, if you're going to
handle it as *text*, not binary.

The most common normalization is NFC, because it works best
everywhere and causes the least amount of surprise. E.g.
"Löwis"[2] results in "w", not in u'\u0308' (COMBINING DIAERESIS),
which most naive users won't expect.

> Also, there is still room for subtle issues, e.g. when concatenating
> two normalized strings will produce a string that isn't normalized.

Sure:

>>> from unicodedata import normalize as n
>>> a=n('NFD', u'ö'); n('NFC', a[0])+n('NFC', a[1:]) == n('NFC', a)
False

But a partial solution is better than no solution.

> Also, in many cases, strings come from IO, not from source, so if
> it is important that they are in NFC, you need to normalize anyway.

Indeed, and it would be best if this happened automatically, like
handling of line endings. It doesn't need to always work, just
most of the time.

I haven't read description of Python's syntax, but this happens
with Python 2.5:

test.py:

a = """
"""
print repr(a)

Output: '\n'

The line ending there is '\r\n', and Python normalizes it when
reading in the source code, even though '\r\n' matters even less
than doing NFC normalization.