[Python-Dev] PEP 277 (unicode filenames): please review

Jack Jansen Jack.Jansen@oratrix.com
Wed, 14 Aug 2002 21:52:00 +0200


On woensdag, augustus 14, 2002, at 02:13 , Guido van Rossum wrote:
> Note that normalization doesn't belong in the codecs (except perhaps
> as a separate Unicode->Unicode codec, since codecs seem to be useful
> for all string->string transformations).  It's a separate step that
> the application has to request; only the app knows whether a
> particular Unicode string is already normalized or not, and whether
> the expense is useful for the app, or not.

I don't like this, I don't like it at all.

Python jumps through hoops to make 'jack' and u'jack' compare=20
identical and be interchangeable in dict keys and what have you,=20
and now suddenly I find out that there's two ways to say u'j=E4ck'=20
and they won't compare equal. Not good.

I sympathise with the fact that this is difficult (although I=20
still don't understand why: whereas when you want to create the=20
decomposed version I can imagine there's N! ways to notate a=20
character with N combining chars, I would think there's one and=20
only one way to write a combined character), but that shouldn't=20
stop us at least planning to fix this.

And I don't think the burden should fall on the application.=20
That same reasoning could have been followed for making ascii=20
and unicode-ascii-subset compare equal: the application will=20
know it has to convert ascii to unicode before comparing.
--
- Jack Jansen        <Jack.Jansen@oratrix.com>       =20
http://www.cwi.nl/~jack -
- If I can't dance I don't want to be part of your revolution --=20
Emma Goldman -