[Python-Dev] Normalizing unicode? (was: Re: test_unicode_file
failing on Mac OS X)
Edward Loper
edloper at gradient.cis.upenn.edu
Wed Dec 10 13:32:22 EST 2003
Scott David Daniels wrote:
> I naïvely wrote:
> >Could we perhaps use a comparison that, in effect, did:
> > def uni_equal(first, second):
> > if first == second:
> > return True
> > return first.normalize() == second.normalize()
> >That is, take advantage of the fact that normalization is often
> >unnecessary for "trivial" reasons.
>
> [...]
Before we start considering how it's possible to make unicode.__equal__
act encoding-insensitively[1], I think we need to consider whether
that's really the behavior we want. In some ways, this seems like
case-insensitive equality to me: it's certainly a useful operation, but
I don't think it should be the object's builtin notion of equality..
- I think people will be confused if s1==s2 but s1[0]!=s2[0].
- Sometimes you might *want* to distinguish different encodings of
the "same" string; a "normalized" equality test makes that very
difficult.
And if you *do* want unicode objects to act normalized, then I think
that the right way to do it is to normalize them at creation time. Then
all the right hash/eq/cmp stuff just falls out.
But since some people will may want to distinguish different encodings
of the same string, I think that the most sensible alternative is to add
a new subclass to unicode -- something like "normalized_unicode." It
would normalize itself at construction time; and when combined with
other unicode strings (eg by +), the result would be normalized (so
unicode+normalized_unicode -> normalized_unicode). It's possible that
the normalized unicode class would be more useful to people (and
therefore more widely used?), but the non-normalized version would still
be available for people who want it.
(or we could just leave things as they are now, and force people to do
any normalization themselves. :) )
-Edward
[1] I don't think that "encoding" is the right technical term here, but
I'm not sure what the right term is. I mean insensitive to the
difference between separated diacritics & unified diacritics.
More information about the Python-Dev
mailing list