[Python-3000] String comparison

Wed Jun 6 20:18:53 CEST 2007

On 6/6/07, Guido van Rossum <guido at python.org> wrote:
> Why should the lexer apply normalization to literals behind my back?

The lexer shouldn't, but NFC normalizing the source before the lexer
sees it would be slightly more robust and standards-compliant. This is
because technically an editor or any other program is allowed by the
Unicode standard to apply any normalization or other canonical
equivalent replacement it sees fit, and other programs aren't supposed
to care. The standard even says that such differences should be rendered
in an indistinguishable way. Practically everyone uses NFC, though.

> There's a simpler solution. The unicode (or str, in Py3k) data type
> represents a sequence of code points, not a sequence of characters.
> This has always been the case, and will continue to be the case.

This is how Java and ICU (http://www.icu-project.org/) do it, too.
The latter is a library specifically designed for processing Unicode
text. Both Java and ICU are even mentioned in the Unicode FAQ.

> Clearly we will have a normalization routine so the lexer can
> normalize identifiers, so if you need normalized data it is
> as simple as writing 'XXX'.normalize() (or whatever the spelling
> should be).

The routine is at the moment at unicodedata.normalize.