[Python-Dev] Unicode comparisons & normalization
Fredrik Lundh
Fredrik Lundh" <effbot@telia.com
Wed, 3 May 2000 11:02:09 +0200
Just van Rossum wrote:
> After quickly browsing through the unicode.org URLs I posted earlier, =
I
> reach the following (possibly wrong) conclusions:
here's another good paper that covers this, the universe, and =
everything:
Character Model for the World Wide Web=20
http://www.w3.org/TR/charmod
among many other things, it argues that normalization should be done at
the source, and that it should be sufficient to do binary matching to =
tell
if two strings are identical.
...
another very interesting thing from that paper is where they identify =
four
layers of character support:
Layer 1: Physical representation. This is necessary for
APIs that expose a physical representation of string data.
/.../ To avoid problems with duplicates, it is assumed that
the data is normalized /.../=20
Layer 2: Indexing based on abstract codepoints. /.../ This
is the highest layer of abstraction that ensures interopera-
bility with very low implementation effort. To avoid problems
with duplicates, it is assumed that the data is normalized /.../
=20
Layer 3: Combining sequences, user-relevant. /.../ While we
think that an exact definition of this layer should be possible,
such a definition does not currently exist.
Layer 4: Depending on language and operation. This layer is
least suited for interoperability, but is necessary for certain
operations, e.g. sorting.=20
until now, this discussion has focussed on the boundary between
layer 1 and 2.
that as many python strings as possible should be on the second
layer has always been obvious to me ("a very low implementation
effort" is exactly my style ;-), and leave the rest for the app.
...while Guido and MAL has argued that we should stay on level 1
(apparantly because "we've already implemented it" is less effort
that "let's change a little bit")
no wonder they never understand what I'm talking about...
it's also interesting to see that MAL's using layer 3 and 4 issues as an
argument to keep Python's string support at layer 1. in contrast, the
W3 paper thinks that normalization is a non-issue also on the layer 1
level. go figure.
...
btw, how about adopting this paper as the "Character Model for Python"?
yes, I'm serious.
</F>
PS. here's my take on Just's normalization points:
> - there is a script and language independent canonical form (but =
automatic
> normalization is indeed a bad idea)
> - ideally, unicode comparisons should follow the rules from
> http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly =
realistic
> for 1.6, if at all...)
note that W3 paper recommends early normalization, and binary
comparision (assuming the same internal representation of the
unicode character codes, of course).
> - this would indeed mean that it's possible for u =3D=3D v even though =
type(u)
> is type(v) and len(u) !=3D len(v). However, I don't see how this would
> collapse /F's world, as the two strings are at most semantically
> equivalent. Their physical difference is real, and still follows the
> a-string-is-a-sequence-of-characters rule (!).
yes, but on layer 3 instead of layer 2.
> - there may be additional customized language-specific sorting rules. =
I
> currently don't see how to implement that without some global =
variable.
layer 4.
> - the sorting rules are very complicated, and should be implemented by
> calculating "sort keys". If I understood it correctly, these can take =
up to
> 4 bytes per character in its most compact form. Still, for it to be
> somewhat speed-efficient, they need to be cached...
layer 4.
> - u.find() may need an alternative API, which returns a (begin, end) =
tuple,
> since the match may not have the same length as the search string... =
(This
> is tricky, since you need the begin and end indices in the =
non-canonical
> form...)
layer 3.