[Python-Dev] Normalizing unicode? (was: Re: test_unicode_file
failing on Mac OS X)
Guido van Rossum
guido at python.org
Wed Dec 10 12:39:16 EST 2003
> Before we start considering how it's possible to make unicode.__equal__
> act encoding-insensitively[1], I think we need to consider whether
> that's really the behavior we want. In some ways, this seems like
> case-insensitive equality to me: it's certainly a useful operation, but
> I don't think it should be the object's builtin notion of equality..
> - I think people will be confused if s1==s2 but s1[0]!=s2[0].
> - Sometimes you might *want* to distinguish different encodings of
> the "same" string; a "normalized" equality test makes that very
> difficult.
Right. Couldn't have said it better myself.
> And if you *do* want unicode objects to act normalized, then I think
> that the right way to do it is to normalize them at creation time. Then
> all the right hash/eq/cmp stuff just falls out.
Exactly.
> But since some people will may want to distinguish different encodings
> of the same string, I think that the most sensible alternative is to add
> a new subclass to unicode -- something like "normalized_unicode." It
> would normalize itself at construction time; and when combined with
> other unicode strings (eg by +), the result would be normalized (so
> unicode+normalized_unicode -> normalized_unicode). It's possible that
> the normalized unicode class would be more useful to people (and
> therefore more widely used?), but the non-normalized version would still
> be available for people who want it.
Works for me. I recomment that someone try this approach as a user
subclass first -- this should be easy enough, right?
> (or we could just leave things as they are now, and force people to do
> any normalization themselves. :) )
Do we even have normalization code in core Python?
--Guido van Rossum (home page: http://www.python.org/~guido/)
More information about the Python-Dev
mailing list