[Python-Dev] Unicode comparisons & normalization

Wed, 3 May 2000 11:02:09 +0200

Just van Rossum wrote:
> After quickly browsing through the unicode.org URLs I posted earlier, =
I
> reach the following (possibly wrong) conclusions:

here's another good paper that covers this, the universe, and =
everything:

    Character Model for the World Wide Web=20
    http://www.w3.org/TR/charmod

among many other things, it argues that normalization should be done at
the source, and that it should be sufficient to do binary matching to =
tell
if two strings are identical.

...

another very interesting thing from that paper is where they identify =
four
layers of character support:

    Layer 1: Physical representation. This is necessary for
    APIs that expose a physical representation of string data.
    /.../ To avoid problems with duplicates, it is assumed that
    the data is normalized /.../=20

    Layer 2: Indexing based on abstract codepoints. /.../ This
    is the highest layer of abstraction that ensures interopera-
    bility with very low implementation effort. To avoid problems
    with duplicates, it is assumed that the data is normalized /.../
=20
    Layer 3: Combining sequences, user-relevant. /.../ While we
    think that an exact definition of this layer should be possible,
    such a definition does not currently exist.

    Layer 4: Depending on language and operation. This layer is
    least suited for interoperability, but is necessary for certain
    operations, e.g. sorting.=20

until now, this discussion has focussed on the boundary between
layer 1 and 2.

that as many python strings as possible should be on the second
layer has always been obvious to me ("a very low implementation
effort" is exactly my style ;-), and leave the rest for the app.

...while Guido and MAL has argued that we should stay on level 1
(apparantly because "we've already implemented it" is less effort
that "let's change a little bit")

no wonder they never understand what I'm talking about...

it's also interesting to see that MAL's using layer 3 and 4 issues as an
argument to keep Python's string support at layer 1.  in contrast, the
W3 paper thinks that normalization is a non-issue also on the layer 1
level.  go figure.

...

btw, how about adopting this paper as the "Character Model for Python"?

yes, I'm serious.

</F>

PS. here's my take on Just's normalization points:

> - there is a script and language independent canonical form (but =
automatic
> normalization is indeed a bad idea)
> - ideally, unicode comparisons should follow the rules from
> http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly =
realistic
> for 1.6, if at all...)

note that W3 paper recommends early normalization, and binary
comparision (assuming the same internal representation of the
unicode character codes, of course).

> - this would indeed mean that it's possible for u =3D=3D v even though =
type(u)
> is type(v) and len(u) !=3D len(v). However, I don't see how this would
> collapse /F's world, as the two strings are at most semantically
> equivalent. Their physical difference is real, and still follows the
> a-string-is-a-sequence-of-characters rule (!).

yes, but on layer 3 instead of layer 2.

> - there may be additional customized language-specific sorting rules. =
I
> currently don't see how to implement that without some global =
variable.

layer 4.

> - the sorting rules are very complicated, and should be implemented by
> calculating "sort keys". If I understood it correctly, these can take =
up to
> 4 bytes per character in its most compact form. Still, for it to be
> somewhat speed-efficient, they need to be cached...

layer 4.

> - u.find() may need an alternative API, which returns a (begin, end) =
tuple,
> since the match may not have the same length as the search string... =
(This
> is tricky, since you need the begin and end indices in the =
non-canonical
> form...)

layer 3.