[Python-3000] String comparison

Thu Jun 7 23:53:32 CEST 2007

On 6/7/07, Rauli Ruohonen <rauli.ruohonen at gmail.com> wrote:

> ... I will use XML character references to denote code points here.
> Wherever you see such a thing in this e-mail, replace it in your
> mind with the corresponding code point *immediately*. E.g.
> len(r'&#00c5;') == 1, but len(r'\u00c5') == 6.

> In the following code == should be false:

> if "L\u00F6wis" == "Lo\u0308wis":
>     print "Python is Unicode conforming in this respect."

> On 6/7/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> > I think the default case should be that text operations produce the
> > expected result in the text domain, even at the expense of array
> > invariants.

(There was confusion -- an explicit escape such as \u probably stands
out enough to signal the non-default case.  But even there, it would
also be reasonable to say "use something other than text.")

> > People who need arrays of code points have several ways to
> > get them, and the usual comparison operators will work on them
> > as desired.

> But regexps and other string operations won't, and those are the
> whole point of strings,

(I was thinking that regexps would actually take an buffer interface, but...)

How would you expect them to work on arrays of code points?  What sort
of answer should the following produce?

    # matches by codepoints, but doesn't look like it
    "Lo&#0308wis".startswith("Lo")

    # if the above did match, then people will assume ö folds to o
    "L&#00F6wis".startswith("Lo")

    # looks like it matches.  Matches as text.  Does not match as bytes.
    "Lo&#0308wis".startswith("L&#00F6")

-jJ