[Python-Dev] Normalizing unicode? (was: Re: test_unicode_file failing on Mac OS X)

Wed Dec 10 13:32:22 EST 2003

Scott David Daniels wrote:
> I naïvely wrote:
>  >Could we perhaps use a comparison that, in effect, did:
>  >     def uni_equal(first, second):
>  >         if first == second:
>  >             return True
>  >         return first.normalize() == second.normalize()
>  >That is, take advantage of the fact that normalization is often
>  >unnecessary for "trivial" reasons.
> 
> [...]

Before we start considering how it's possible to make unicode.__equal__ 
act encoding-insensitively[1], I think we need to consider whether 
that's really the behavior we want.  In some ways, this seems like 
case-insensitive equality to me: it's certainly a useful operation, but 
I don't think it should be the object's builtin notion of equality..
   - I think people will be confused if s1==s2 but s1[0]!=s2[0].
   - Sometimes you might *want* to distinguish different encodings of
     the "same" string; a "normalized" equality test makes that very
     difficult.

And if you *do* want unicode objects to act normalized, then I think 
that the right way to do it is to normalize them at creation time.  Then 
all the right hash/eq/cmp stuff just falls out.

But since some people will may want to distinguish different encodings 
of the same string, I think that the most sensible alternative is to add 
a new subclass to unicode -- something like "normalized_unicode."  It 
would normalize itself at construction time; and when combined with 
other unicode strings (eg by +), the result would be normalized (so 
unicode+normalized_unicode -> normalized_unicode).  It's possible that 
the normalized unicode class would be more useful to people (and 
therefore more widely used?), but the non-normalized version would still 
be available for people who want it.

(or we could just leave things as they are now, and force people to do 
any normalization themselves. :) )

-Edward

[1] I don't think that "encoding" is the right technical term here, but 
I'm not sure what the right term is.  I mean insensitive to the 
difference between separated diacritics & unified diacritics.