[Python-Dev] Normalizing unicode? (was: Re: test_unicode_file failing on Mac OS X)

Wed Dec 10 12:39:16 EST 2003

> Before we start considering how it's possible to make unicode.__equal__ 
> act encoding-insensitively[1], I think we need to consider whether 
> that's really the behavior we want.  In some ways, this seems like 
> case-insensitive equality to me: it's certainly a useful operation, but 
> I don't think it should be the object's builtin notion of equality..
>    - I think people will be confused if s1==s2 but s1[0]!=s2[0].
>    - Sometimes you might *want* to distinguish different encodings of
>      the "same" string; a "normalized" equality test makes that very
>      difficult.

Right.  Couldn't have said it better myself.

> And if you *do* want unicode objects to act normalized, then I think 
> that the right way to do it is to normalize them at creation time.  Then 
> all the right hash/eq/cmp stuff just falls out.

Exactly.

> But since some people will may want to distinguish different encodings 
> of the same string, I think that the most sensible alternative is to add 
> a new subclass to unicode -- something like "normalized_unicode."  It 
> would normalize itself at construction time; and when combined with 
> other unicode strings (eg by +), the result would be normalized (so 
> unicode+normalized_unicode -> normalized_unicode).  It's possible that 
> the normalized unicode class would be more useful to people (and 
> therefore more widely used?), but the non-normalized version would still 
> be available for people who want it.

Works for me.  I recomment that someone try this approach as a user
subclass first -- this should be easy enough, right?

> (or we could just leave things as they are now, and force people to do 
> any normalization themselves. :) )

Do we even have normalization code in core Python?

--Guido van Rossum (home page: http://www.python.org/~guido/)