[Python-3000] String comparison

Josiah Carlson jcarlson at uci.edu
Wed Jun 6 22:05:37 CEST 2007


Bill Janssen <janssen at parc.com> wrote:
> 
> > Hear me out for a moment.  People type what they want.
> 
> I do a lot of Pythonic processing of UTF-8, which is not "typed by
> people", but instead extracted from documents by automated processing.
> Text is also data -- an important thing to keep in mind.

Right, but (and this is a big but), you are reading data in from a file. 
That is different from source code identifiers and embedded strings.  If
you *want* normalization to happen on your data, that is perfectly
reasonable, and you can do so (Explicit is better than implicit?).  But
if someone didn't want normalization, and Python did it anyways, then
there would be an error that passed silently.


> As far as normalization goes, I agree with you about identifiers, and
> I use "unicodedata.normalize" extensively in the cases where I care
> about normalization of data strings.  The big issue is string literals.
> I think I agree with Stephen here:
> 
>     u"L\u00F6wis" == u"Lo\u0308wis"
> 
> should be True (assuming he typed it correctly in the first place :-),
> because they are the same Unicode string.  I don't understand Guido's
> objection here -- it's a lexer issue, right?  The underlying character
> string will still be the same in both cases.

It's the unicode character versus code point issue.  I personally prefer
code points, as a code point approach does exactly what I want it to do
by default; nothing.  If it *does* something without me asking, then
that would seem to be magic to me, and I'm a minimal magic kind of guy.

 - Josiah



More information about the Python-3000 mailing list