Re: [Python-Dev] Unicode comparisons & normalization

3 May 2000

      Just van Rossum wrote:
...
After quickly browsing through the unicode.org URLs I posted earlier, I
reach the following (possibly wrong) conclusions:
here's another good paper that covers this, the universe, and everything:

    Character Model for the World Wide Web 
    http://www.w3.org/TR/charmod

among many other things, it argues that normalization should be done at
the source, and that it should be sufficient to do binary matching to tell
if two strings are identical.

...

another very interesting thing from that paper is where they identify four
layers of character support:

    Layer 1: Physical representation. This is necessary for
    APIs that expose a physical representation of string data.
    /.../ To avoid problems with duplicates, it is assumed that
    the data is normalized /.../ 

    Layer 2: Indexing based on abstract codepoints. /.../ This
    is the highest layer of abstraction that ensures interopera-
    bility with very low implementation effort. To avoid problems
    with duplicates, it is assumed that the data is normalized /.../

    Layer 3: Combining sequences, user-relevant. /.../ While we
    think that an exact definition of this layer should be possible,
    such a definition does not currently exist.

    Layer 4: Depending on language and operation. This layer is
    least suited for interoperability, but is necessary for certain
    operations, e.g. sorting. 

until now, this discussion has focussed on the boundary between
layer 1 and 2.

that as many python strings as possible should be on the second
layer has always been obvious to me ("a very low implementation
effort" is exactly my style ;-), and leave the rest for the app.

...while Guido and MAL has argued that we should stay on level 1
(apparantly because "we've already implemented it" is less effort
that "let's change a little bit")

no wonder they never understand what I'm talking about...

it's also interesting to see that MAL's using layer 3 and 4 issues as an
argument to keep Python's string support at layer 1.  in contrast, the
W3 paper thinks that normalization is a non-issue also on the layer 1
level.  go figure.

...

btw, how about adopting this paper as the "Character Model for Python"?

yes, I'm serious.

</F>

PS. here's my take on Just's normalization points:
...
- there is a script and language independent canonical form (but automatic
normalization is indeed a bad idea)
- ideally, unicode comparisons should follow the rules from
http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
for 1.6, if at all...)
note that W3 paper recommends early normalization, and binary
comparision (assuming the same internal representation of the
unicode character codes, of course).
...
- this would indeed mean that it's possible for u == v even though type(u)
is type(v) and len(u) != len(v). However, I don't see how this would
collapse /F's world, as the two strings are at most semantically
equivalent. Their physical difference is real, and still follows the
a-string-is-a-sequence-of-characters rule (!).
yes, but on layer 3 instead of layer 2.
...
- there may be additional customized language-specific sorting rules. I
currently don't see how to implement that without some global variable.
layer 4.
...
- the sorting rules are very complicated, and should be implemented by
calculating "sort keys". If I understood it correctly, these can take up to
4 bytes per character in its most compact form. Still, for it to be
somewhat speed-efficient, they need to be cached...
layer 4.
...
- u.find() may need an alternative API, which returns a (begin, end) tuple,
since the match may not have the same length as the search string... (This
is tricky, since you need the begin and end indices in the non-canonical
form...)
layer 3.

Re: [Python-Dev] Unicode comparisons & normalization

Fredrik Lundh