Unicode comparisons & normalization
After quickly browsing through the unicode.org URLs I posted earlier, I reach the following (possibly wrong) conclusions: - there is a script and language independent canonical form (but automatic normalization is indeed a bad idea) - ideally, unicode comparisons should follow the rules from http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic for 1.6, if at all...) - this would indeed mean that it's possible for u == v even though type(u) is type(v) and len(u) != len(v). However, I don't see how this would collapse /F's world, as the two strings are at most semantically equivalent. Their physical difference is real, and still follows the a-string-is-a-sequence-of-characters rule (!). - there may be additional customized language-specific sorting rules. I currently don't see how to implement that without some global variable. - the sorting rules are very complicated, and should be implemented by calculating "sort keys". If I understood it correctly, these can take up to 4 bytes per character in its most compact form. Still, for it to be somewhat speed-efficient, they need to be cached... - u.find() may need an alternative API, which returns a (begin, end) tuple, since the match may not have the same length as the search string... (This is tricky, since you need the begin and end indices in the non-canonical form...) Just
On Wed, 3 May 2000, Just van Rossum wrote:
After quickly browsing through the unicode.org URLs I posted earlier, I reach the following (possibly wrong) conclusions:
- there is a script and language independent canonical form (but automatic normalization is indeed a bad idea) - ideally, unicode comparisons should follow the rules from http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic for 1.6, if at all...)
I just looked through this document. Indeed, there's a lot of work to be done if we want to compare strings this way. I thought the most striking feature was that this comparison method does *not* satisfy the common assumption a > b implies a + c > b + d (+ is concatenation) -- in fact, it is specifically designed to allow for cases where differences in the *later* part of a string can have greater influence than differences in an earlier part of a string. It *does* still guarantee that a + b > a and of course we can still rely on the most basic rules such as a > b and b > c implies a > c There are sufficiently many significant transformations described in the UTR 10 document that i'm pretty sure it is possible for two things to collate equally but not be equivalent. (Even after Unicode normalization, there is still the possibility of rearrangement in step 1.2.) This would be another motivation for Python to carefully separate the three types of equality: is identity-equal == value-equal <=> magnitude-equal We currently don't distinguish between the last two; the operator "<=>" is my proposal for how to spell "magnitude-equal", and in terms of outward behaviour you can consider (a <=> b) to be (a <= b and a >= b). I suspect we will find ourselves needing it if we do rich comparisons anyway. (I don't know of any other useful kinds of equality, but if you've run into this before, do pipe up...) -- ?!ng
[Ping]
This would be another motivation for Python to carefully separate the three types of equality:
is identity-equal == value-equal <=> magnitude-equal
We currently don't distinguish between the last two; the operator "<=>" is my proposal for how to spell "magnitude-equal", and in terms of outward behaviour you can consider (a <=> b) to be (a <= b and a >= b). I suspect we will find ourselves needing it if we do rich comparisons anyway.
I don't think that this form of equality deserves its own operator. The Unicode comparison rules are sufficiently hairy that it seems better to implement them separately, either in a separate module or at least as a Unicode-object-specific method, and let the == operator do what it does best: compare the representations. --Guido van Rossum (home page: http://www.python.org/~guido/)
Just van Rossum wrote:
After quickly browsing through the unicode.org URLs I posted earlier, I reach the following (possibly wrong) conclusions:
here's another good paper that covers this, the universe, and everything: Character Model for the World Wide Web http://www.w3.org/TR/charmod among many other things, it argues that normalization should be done at the source, and that it should be sufficient to do binary matching to tell if two strings are identical. ... another very interesting thing from that paper is where they identify four layers of character support: Layer 1: Physical representation. This is necessary for APIs that expose a physical representation of string data. /.../ To avoid problems with duplicates, it is assumed that the data is normalized /.../ Layer 2: Indexing based on abstract codepoints. /.../ This is the highest layer of abstraction that ensures interopera- bility with very low implementation effort. To avoid problems with duplicates, it is assumed that the data is normalized /.../ Layer 3: Combining sequences, user-relevant. /.../ While we think that an exact definition of this layer should be possible, such a definition does not currently exist. Layer 4: Depending on language and operation. This layer is least suited for interoperability, but is necessary for certain operations, e.g. sorting. until now, this discussion has focussed on the boundary between layer 1 and 2. that as many python strings as possible should be on the second layer has always been obvious to me ("a very low implementation effort" is exactly my style ;-), and leave the rest for the app. ...while Guido and MAL has argued that we should stay on level 1 (apparantly because "we've already implemented it" is less effort that "let's change a little bit") no wonder they never understand what I'm talking about... it's also interesting to see that MAL's using layer 3 and 4 issues as an argument to keep Python's string support at layer 1. in contrast, the W3 paper thinks that normalization is a non-issue also on the layer 1 level. go figure. ... btw, how about adopting this paper as the "Character Model for Python"? yes, I'm serious. </F> PS. here's my take on Just's normalization points:
- there is a script and language independent canonical form (but automatic normalization is indeed a bad idea) - ideally, unicode comparisons should follow the rules from http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic for 1.6, if at all...)
note that W3 paper recommends early normalization, and binary comparision (assuming the same internal representation of the unicode character codes, of course).
- this would indeed mean that it's possible for u == v even though type(u) is type(v) and len(u) != len(v). However, I don't see how this would collapse /F's world, as the two strings are at most semantically equivalent. Their physical difference is real, and still follows the a-string-is-a-sequence-of-characters rule (!).
yes, but on layer 3 instead of layer 2.
- there may be additional customized language-specific sorting rules. I currently don't see how to implement that without some global variable.
layer 4.
- the sorting rules are very complicated, and should be implemented by calculating "sort keys". If I understood it correctly, these can take up to 4 bytes per character in its most compact form. Still, for it to be somewhat speed-efficient, they need to be cached...
layer 4.
- u.find() may need an alternative API, which returns a (begin, end) tuple, since the match may not have the same length as the search string... (This is tricky, since you need the begin and end indices in the non-canonical form...)
layer 3.
here's another good paper that covers this, the universe, and everything:
Theer's a lot of useful pointers being flung around. Could someone with more spare cycles than I currently have perhaps collect these and produce a little write up "further reading on Unicode comparison and normalization" (or perhaps a more comprehensive title if warranted) to be added to the i18n-sig's home page? --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (4)
-
Fredrik Lundh
-
Guido van Rossum
-
Just van Rossum
-
Ka-Ping Yee