Just van Rossum wrote:
After quickly browsing through the unicode.org URLs I posted earlier, I reach the following (possibly wrong) conclusions:
here's another good paper that covers this, the universe, and everything: Character Model for the World Wide Web http://www.w3.org/TR/charmod among many other things, it argues that normalization should be done at the source, and that it should be sufficient to do binary matching to tell if two strings are identical. ... another very interesting thing from that paper is where they identify four layers of character support: Layer 1: Physical representation. This is necessary for APIs that expose a physical representation of string data. /.../ To avoid problems with duplicates, it is assumed that the data is normalized /.../ Layer 2: Indexing based on abstract codepoints. /.../ This is the highest layer of abstraction that ensures interopera- bility with very low implementation effort. To avoid problems with duplicates, it is assumed that the data is normalized /.../ Layer 3: Combining sequences, user-relevant. /.../ While we think that an exact definition of this layer should be possible, such a definition does not currently exist. Layer 4: Depending on language and operation. This layer is least suited for interoperability, but is necessary for certain operations, e.g. sorting. until now, this discussion has focussed on the boundary between layer 1 and 2. that as many python strings as possible should be on the second layer has always been obvious to me ("a very low implementation effort" is exactly my style ;-), and leave the rest for the app. ...while Guido and MAL has argued that we should stay on level 1 (apparantly because "we've already implemented it" is less effort that "let's change a little bit") no wonder they never understand what I'm talking about... it's also interesting to see that MAL's using layer 3 and 4 issues as an argument to keep Python's string support at layer 1. in contrast, the W3 paper thinks that normalization is a non-issue also on the layer 1 level. go figure. ... btw, how about adopting this paper as the "Character Model for Python"? yes, I'm serious. </F> PS. here's my take on Just's normalization points:
- there is a script and language independent canonical form (but automatic normalization is indeed a bad idea) - ideally, unicode comparisons should follow the rules from http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic for 1.6, if at all...)
note that W3 paper recommends early normalization, and binary comparision (assuming the same internal representation of the unicode character codes, of course).
- this would indeed mean that it's possible for u == v even though type(u) is type(v) and len(u) != len(v). However, I don't see how this would collapse /F's world, as the two strings are at most semantically equivalent. Their physical difference is real, and still follows the a-string-is-a-sequence-of-characters rule (!).
yes, but on layer 3 instead of layer 2.
- there may be additional customized language-specific sorting rules. I currently don't see how to implement that without some global variable.
layer 4.
- the sorting rules are very complicated, and should be implemented by calculating "sort keys". If I understood it correctly, these can take up to 4 bytes per character in its most compact form. Still, for it to be somewhat speed-efficient, they need to be cached...
layer 4.
- u.find() may need an alternative API, which returns a (begin, end) tuple, since the match may not have the same length as the search string... (This is tricky, since you need the begin and end indices in the non-canonical form...)
layer 3.